KDD Cup 2011 Triton Miners

7/31/2019 KDD Cup 2011 Triton Miners

1/17

Triton Miners: Competing in the KDD Cup 2011

May 5, 2011

Abstract

This status report contains the ideas and the experiments that we have performed or are

currently performing on the KDD Cup 2011 dataset. The dataset is biggest of its kind with

some unique features like hierarchical relations among items, different types of items, and

dates/timestamps for ratings. Currently we have implemented several variants of matrix fac-

torization approaches. We are currently at 25.0426 RMSE on the test set using Alternative

Least Squares approach. We are currently at 92th position on the leader board. After paral-

lelizing the training, one epoch takes roughly in the range of 200-400 seconds.

1


2/17

Contents

1 Introduction 3

2 Dataset 3

3 Experiments and Results 7

3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) . . 8

3.3 Sigmoid based Matrix Factorization (SMF) . . . . . . . . . . . . . . . . . . . . . 8

3.3.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Sigmoid based Heirarchical Matrix Factorization (SHMF) . . . . . . . . . . . . . 8

3.5 Alternating least squares based Matrix Factorization (ALS) . . . . . . . . . . . . . 93.5.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.6 Latent Feature log linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.7 Neighborhood Based Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.8.1 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Ideas for further exploration 14

5 Parallelism 15

5.1 Alternating update and grouping strategy . . . . . . . . . . . . . . . . . . . . . . 15

5.2 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . 15

6 Software 16

2


3/17

1 Introduction

This report investigates different collaborative filtering methods on the KDD Cup 2011 dataset.

The dataset was provided by Yahoo!, and was collected from their music service. It is the biggest

of its kind which restricts our choice of algorithms to the ones that scale. Apart from the typical

(user,item,rating) triplets, there is hierarchical information among the items (tracks/albums/artist/genre)

amd time stamps which needs to be exploited. So far, we have been able to parallelize several vari-

ants of matrix factorization approaches run it in the order of minutes per epoch. Apart from that we

have analyzed the dataset in terms of types of items and found out there is significant overfitting

for tracks and albums. Furthermore, on the validation set we found out that majority of the error

is on items that are rated fewer times. The rest of the report contains our current progress and

description of the dataset.

2 Dataset

The KDD Cup 2011 competition has two tracks. This report presents the experiments we per-

formed on Track 1 dataset. The statistics for the dataset is presented in the table 1. The ratings

range from 0-100 and the dates range from roughly [0-5xxx] days. There is also session informa-

tion present in the dataset along with the days.

Figure 1: Dataset format

3


4/17

#Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings

1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940

Table 1. KDD Cup 2011 Track 1 Dataset.

#Genres 992

#Artists 27888

#Albums 88909

#Tracks 507172

Table 2. Track 1 - Hierarchy statistics for items .

Figure 2: Training set rating histogram

Figure 3: Hierarchical Information

4


5/17

(a) Training set (b) Validation set

(c) Test set

Figure 4: Distribution of ratings

5


6/17

(a) Histogram of ratings of Track (b) Histogram of ratings of Album

(c) Histogram of ratings of Artist (d) Histogram of ratings of Genre

Figure 5: Histogram specific to item type (log is to the base e)

6


7/17

3 Experiments and Results

3.1 Notation

ru,i True rating for user- (u) and item - (i)ru,i Predicted rating for user - (u) and item - (i)Uu Latent feature vector for user - (u)Ii Latent feature vector for item - (i)k Size of feature vector i.e. latent factorsU Concatenated feature matrix for all usersI Concatenated feature matrix for all itemsNu Number of users

Ni Number of items Learning rate parameter Regularization parameter(x) Sigmoid on x

7


8/17


9/17

b) Regularization term to make Ii for each item in the hierarchy similar, which is motivated by the

fact that users tend to rate items in the same hierarchy similarly for ex: rating for a track and itscorresponding album would be similar.

for each epoch

Update U using I

Update Ii using U (i Genres)Update Ii using U (i Artists)Update Ii using U (i Albums) and regularization by (Ii Iartist(i))Update Ii using U (i Tracks) and regularization by (Ii Ialbums(i))

3.5 Alternating least squares based Matrix Factorization (ALS)

This method was first presented in [4]. The main differences compared to previously dicussed

methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularizationparameter is multiplied by the number of ratings for that user (nu) or item (ni).

Objective Function E=

(ru,iUu Ii)2 +(

nu||Uu||

2 +

ni||Ii||2)

Least squares solution for a Uu and Ii(MI(u)M

TI(u) + (nuE))Uu = Vu

where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated.and E is the identity matrix and Vu = MI(u)R

T(u, I(u))Optimization Type LS

Update Rule Uu A1u Vu where A = (MI(u)M

TI(u) + (nuE))

Ii B1i Yi; derivation similar to Uu

3.5.1 Adding Temporal Term

The objective function after adding temporal term isE=

(ru,i

k UukIikTtk))2+(

nu||Uu||

2+ni||Ii||

2). The rest of the derivation is similar to the previous section. We first learn the U and Imatrices by fixing all elements ofT to be 1 and T is estimated in the end after we have estimatedfor U and Imatrices.

3.6 Latent Feature log linear model

In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} eachcorresponding to c = {0,...11} classes and learn latent features for each of the ratings. We fix U0

9


10/17

and I0 to be zero i.e. keeping class 0 as the base class.

Objective Function E=

(ru,i

c R

c exp(Ucu Ici )

Z)2 + (

c ||U

cu||

2 +

||Ici ||2)

Z=

c exp(Ucu I

ci ) - Normalization term

p(c|Uc, Ic) =exp(Ucu I

ci )

Z

r =

c R

c exp(Ucu Ici )

ZDerivative with respect to each example

foreach c Uc

uk

(ru,i

c Rc exp(Ucu I

ci ))

2 = 2(ru,i

c(Rcp(c|Uc, Ic))p(c|Uc, Ic)

(Rc

c(Rcp(c|Uc, Ic))Icik

foreach c

Ic

ik

(ru,i c

Rc exp(Ucu Ici ))

2 = 2(ru,i c

(Rcp(c|Uc, Ic))p(c|Uc, Ic)

(Rc

c(Rcp(c|Uc, Ic))Ucuk

Optimization Type SGD

Update Rule Ucuk Ucuk ((

Uc

uk

E) + (Ucuk))

Icik Icik ((

Ic

ik

E) + (Icik))

Another scheme to cut down on parameters it to keep a single Uu for all ratings. Although wehave implemented this scheme, we have skipped running experiments with it since we did not see

significant difference between the schemes in the initial runs.

3.7 Neighborhood Based Correction

We use the method in [3] which is a post processing step after learning the latent features for

users and items. The NB correction is as follows

rcui = rpui +

j,j=i sim(itemi, itemj)(ruj r

puj)

j,j=i sim(itemi,itemj), where rc is the corrected rating for user u

and item i and sim is the similarity metric. is learned through regression on validation set. Thesummation j is over all items the user has rated in the training set.

sim(itemi,itemj) = max{0,

k Iik.Ijk

k I2ikk I

2jk

}

10


11/17

3.8 Results

The figure below shows the RMSE on train and validations set for sigmoid matrix factorization.

Figure 6: SMF: RMSE on train and validation set

The result show overfitting after the second epoch. Item type specific RMSE is shown in fig 7.

11


12/17

(a) RMSE Tracks (b) RMSE Albums

(c) RMSE Artists (d) RMSE Genres

Figure 7: SMF: RMSE specific to item type

12


13/17

(a) RMSE vs log( No: of ratings for Tracks) (b) RMSE vs log( No: of ratings for Albums)

(c) RMSE vs log( No: of ratings for Artists) (d) RMSE vs log( No: of ratings for Genres)

Figure 8: These are from the ALS run: RMSE vs No: of ratings (Note: red line shows the actual

RMSE for that specific type)

The results have been summarized in the table below. The results from tensor factorization has

been excluded since we are not confident about the training scheme. The LFL run was not tuned

for best parameters. Regularization is similar in all the methods except in ALS where its multiplied

by Nu or Ni.

13


14/17

Method(//k) RMSE (Test set) RMSE with NB correction(Test Set)

BRISMF(.001/.001/100) 28.6200 -SMF (10/.0001/100) 25.6736 25.4884

SHMF (10/.0001/100) 25.1183 -

ALS (1/-/50) 25.0426 -

LFL (10/.0001/100) 26.5238 -

Table 4. Current Results on Test Set

3.8.1 Timing Information

All these runs were using 7 cores on the same node. It takes around 250 seconds to load all the files

into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds.

Method(k) Time in sec per epoch

SMF (100) 200

ALS (50) 400

Table 4. Time

4 Ideas for further exploration

There are multiple schemes for residual fitting mentioned in [5] which needs attention. Another

idea is to exploit hierarchy in the constrained method described in [2] where Uu = Yu +

IiWiIi

.

where Ii is 1 if the user u has rated item i in the training set. We see that the NB correction methodon SMF does improve the test RMSE and NB seems to have the same essence of the constraint

method which is users which have rated items similarly tend to rate items in similar fashion. Both

NB correction and constrained method tries to make the latent user features for similar users closer.

The authors of [2] have noted that for users that have sparse ratings the constrained method pro-

vides a considerable improvement. We need to first think of a way to parallelize the updates for

Wi then ponder over how to exploit hierarchy here.

One of the other contestants have claimed that he has got an RMSE of 23.97 using ALS withvalidation set in training using 100 latent features[6]. He is currently at rank47th. We could op-timistically assume that after adding constrained feature terms and including the validation set in

training we should reach top 20.

14


15/17

Another scheme to try out is to blend different results. Since we are currently aiming at learning

more about the dataset and coming up with better RMSE on validation set using a single method,we feel we should leave this till the end.

Another note is that alternating update and grouping stragegy is faster but RMSE diverges

for some initialization of latent matrices. The results of ALS tensor factorization is similar to

ALS without the temporal term on the validation set. But one major difference is that the tensor

factorization is achieving significantly lower RMSE on the training set (18.xx compared to 20.xx

).

5 Parallelism

5.1 Alternating update and grouping strategy

In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while fixing

I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates.

The matrix being updated is split into N groups and each group is updated independently.

Figure 9: Each of the blocks of User matrix is updated independently.

5.2 Joint SGD Update by grouping strategy

In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing

(u,i) pairs as illustrated in the figure below. This scheme can be recursively applied to each of the

disjoint set for further levels parallelism. However, since the alternating update strategy seems towork for all the algorithms discussed, this scheme has not been implemented yet.

15


16/17

Figure 10: Joint SGD update by grouping independent U and I

6 Software

We initially chose Matlab for implementing the baseline methods. But it turned out that parallel

processing on Matlab turned out be slower than sequential processing on single node. We suspect

that the slowness is because of some communication overhead in our code which we havent been

able to debug. After spending several days trying to figure out the problem, We gave up and decided

to code in C++ which turned out to be good choice. The pthreads library on GNU/Linux is

being used currently for parallelism.

As far as we know, there exists no efficient collaborative filtering packages online. As a byprod-

uct of the competition, we are also trying to build a robust and efficient package in the lines of

liblinear for regression.

References

1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic

prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia,

2010

2. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization, Advances in Neural

Information Processing Systems. MIT Press, Cambridge, MA, 2008.

3. Gabor Takcs, Istvan Pilaszy, Bottyan Nemeth, Domonkos Tikk, Scalable Collaborative Fil-

tering Approaches for Large Recommender Systems, Journal of Machine Learning Re-

search 10: 623-656 (2009)

4. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Fil-

tering for the Netflix Prize,In AAIM(2008) 337-348

16


17/17

5. A. Paterek. Improving regularized Singular Value Decomposition for collaborative filtering.

Proceedings of KDD Cup and Workshop, 2007.

6. http://groups.google.com/group/graphlab-kdd

17

KDD Cup 2011 Triton Miners

Documents

Transcript of KDD Cup 2011 Triton Miners