KDD Cup 2011 Triton Miners

download KDD Cup 2011 Triton Miners

of 17

Transcript of KDD Cup 2011 Triton Miners

  • 7/31/2019 KDD Cup 2011 Triton Miners

    1/17

    Triton Miners: Competing in the KDD Cup 2011

    May 5, 2011

    Abstract

    This status report contains the ideas and the experiments that we have performed or are

    currently performing on the KDD Cup 2011 dataset. The dataset is biggest of its kind with

    some unique features like hierarchical relations among items, different types of items, and

    dates/timestamps for ratings. Currently we have implemented several variants of matrix fac-

    torization approaches. We are currently at 25.0426 RMSE on the test set using Alternative

    Least Squares approach. We are currently at 92th position on the leader board. After paral-

    lelizing the training, one epoch takes roughly in the range of 200-400 seconds.

    1

  • 7/31/2019 KDD Cup 2011 Triton Miners

    2/17

    Contents

    1 Introduction 3

    2 Dataset 3

    3 Experiments and Results 7

    3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2 Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) . . 8

    3.3 Sigmoid based Matrix Factorization (SMF) . . . . . . . . . . . . . . . . . . . . . 8

    3.3.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.4 Sigmoid based Heirarchical Matrix Factorization (SHMF) . . . . . . . . . . . . . 8

    3.5 Alternating least squares based Matrix Factorization (ALS) . . . . . . . . . . . . . 93.5.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.6 Latent Feature log linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.7 Neighborhood Based Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.8.1 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4 Ideas for further exploration 14

    5 Parallelism 15

    5.1 Alternating update and grouping strategy . . . . . . . . . . . . . . . . . . . . . . 15

    5.2 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . 15

    6 Software 16

    2

  • 7/31/2019 KDD Cup 2011 Triton Miners

    3/17

    1 Introduction

    This report investigates different collaborative filtering methods on the KDD Cup 2011 dataset.

    The dataset was provided by Yahoo!, and was collected from their music service. It is the biggest

    of its kind which restricts our choice of algorithms to the ones that scale. Apart from the typical

    (user,item,rating) triplets, there is hierarchical information among the items (tracks/albums/artist/genre)

    amd time stamps which needs to be exploited. So far, we have been able to parallelize several vari-

    ants of matrix factorization approaches run it in the order of minutes per epoch. Apart from that we

    have analyzed the dataset in terms of types of items and found out there is significant overfitting

    for tracks and albums. Furthermore, on the validation set we found out that majority of the error

    is on items that are rated fewer times. The rest of the report contains our current progress and

    description of the dataset.

    2 Dataset

    The KDD Cup 2011 competition has two tracks. This report presents the experiments we per-

    formed on Track 1 dataset. The statistics for the dataset is presented in the table 1. The ratings

    range from 0-100 and the dates range from roughly [0-5xxx] days. There is also session informa-

    tion present in the dataset along with the days.

    Figure 1: Dataset format

    3

  • 7/31/2019 KDD Cup 2011 Triton Miners

    4/17

    #Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings

    1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940

    Table 1. KDD Cup 2011 Track 1 Dataset.

    #Genres 992

    #Artists 27888

    #Albums 88909

    #Tracks 507172

    Table 2. Track 1 - Hierarchy statistics for items .

    Figure 2: Training set rating histogram

    Figure 3: Hierarchical Information

    4

  • 7/31/2019 KDD Cup 2011 Triton Miners

    5/17

    (a) Training set (b) Validation set

    (c) Test set

    Figure 4: Distribution of ratings

    5

  • 7/31/2019 KDD Cup 2011 Triton Miners

    6/17

    (a) Histogram of ratings of Track (b) Histogram of ratings of Album

    (c) Histogram of ratings of Artist (d) Histogram of ratings of Genre

    Figure 5: Histogram specific to item type (log is to the base e)

    6

  • 7/31/2019 KDD Cup 2011 Triton Miners

    7/17

    3 Experiments and Results

    3.1 Notation

    ru,i True rating for user- (u) and item - (i)ru,i Predicted rating for user - (u) and item - (i)Uu Latent feature vector for user - (u)Ii Latent feature vector for item - (i)k Size of feature vector i.e. latent factorsU Concatenated feature matrix for all usersI Concatenated feature matrix for all itemsNu Number of users

    Ni Number of items Learning rate parameter Regularization parameter(x) Sigmoid on x

    7

  • 7/31/2019 KDD Cup 2011 Triton Miners

    8/17

  • 7/31/2019 KDD Cup 2011 Triton Miners

    9/17

    b) Regularization term to make Ii for each item in the hierarchy similar, which is motivated by the

    fact that users tend to rate items in the same hierarchy similarly for ex: rating for a track and itscorresponding album would be similar.

    for each epoch

    Update U using I

    Update Ii using U (i Genres)Update Ii using U (i Artists)Update Ii using U (i Albums) and regularization by (Ii Iartist(i))Update Ii using U (i Tracks) and regularization by (Ii Ialbums(i))

    3.5 Alternating least squares based Matrix Factorization (ALS)

    This method was first presented in [4]. The main differences compared to previously dicussed

    methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularizationparameter is multiplied by the number of ratings for that user (nu) or item (ni).

    Objective Function E=

    (ru,iUu Ii)2 +(

    nu||Uu||

    2 +

    ni||Ii||2)

    Least squares solution for a Uu and Ii(MI(u)M

    TI(u) + (nuE))Uu = Vu

    where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated.and E is the identity matrix and Vu = MI(u)R

    T(u, I(u))Optimization Type LS

    Update Rule Uu A1u Vu where A = (MI(u)M

    TI(u) + (nuE))

    Ii B1i Yi; derivation similar to Uu

    3.5.1 Adding Temporal Term

    The objective function after adding temporal term isE=

    (ru,i

    k UukIikTtk))2+(

    nu||Uu||

    2+ni||Ii||

    2). The rest of the derivation is similar to the previous section. We first learn the U and Imatrices by fixing all elements ofT to be 1 and T is estimated in the end after we have estimatedfor U and Imatrices.

    3.6 Latent Feature log linear model

    In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} eachcorresponding to c = {0,...11} classes and learn latent features for each of the ratings. We fix U0

    9

  • 7/31/2019 KDD Cup 2011 Triton Miners

    10/17

    and I0 to be zero i.e. keeping class 0 as the base class.

    Objective Function E=

    (ru,i

    c R

    c exp(Ucu Ici )

    Z)2 + (

    c ||U

    cu||

    2 +

    ||Ici ||2)

    Z=

    c exp(Ucu I

    ci ) - Normalization term

    p(c|Uc, Ic) =exp(Ucu I

    ci )

    Z

    r =

    c R

    c exp(Ucu Ici )

    ZDerivative with respect to each example

    foreach c Uc

    uk

    (ru,i

    c Rc exp(Ucu I

    ci ))

    2 = 2(ru,i

    c(Rcp(c|Uc, Ic))p(c|Uc, Ic)

    (Rc

    c(Rcp(c|Uc, Ic))Icik

    foreach c

    Ic

    ik

    (ru,i c

    Rc exp(Ucu Ici ))

    2 = 2(ru,i c

    (Rcp(c|Uc, Ic))p(c|Uc, Ic)

    (Rc

    c(Rcp(c|Uc, Ic))Ucuk

    Optimization Type SGD

    Update Rule Ucuk Ucuk ((

    Uc

    uk

    E) + (Ucuk))

    Icik Icik ((

    Ic

    ik

    E) + (Icik))

    Another scheme to cut down on parameters it to keep a single Uu for all ratings. Although wehave implemented this scheme, we have skipped running experiments with it since we did not see

    significant difference between the schemes in the initial runs.

    3.7 Neighborhood Based Correction

    We use the method in [3] which is a post processing step after learning the latent features for

    users and items. The NB correction is as follows

    rcui = rpui +

    j,j=i sim(itemi, itemj)(ruj r

    puj)

    j,j=i sim(itemi,itemj), where rc is the corrected rating for user u

    and item i and sim is the similarity metric. is learned through regression on validation set. Thesummation j is over all items the user has rated in the training set.

    sim(itemi,itemj) = max{0,

    k Iik.Ijk

    k I2ikk I

    2jk

    }

    10

  • 7/31/2019 KDD Cup 2011 Triton Miners

    11/17

    3.8 Results

    The figure below shows the RMSE on train and validations set for sigmoid matrix factorization.

    Figure 6: SMF: RMSE on train and validation set

    The result show overfitting after the second epoch. Item type specific RMSE is shown in fig 7.

    11

  • 7/31/2019 KDD Cup 2011 Triton Miners

    12/17

    (a) RMSE Tracks (b) RMSE Albums

    (c) RMSE Artists (d) RMSE Genres

    Figure 7: SMF: RMSE specific to item type

    12

  • 7/31/2019 KDD Cup 2011 Triton Miners

    13/17

    (a) RMSE vs log( No: of ratings for Tracks) (b) RMSE vs log( No: of ratings for Albums)

    (c) RMSE vs log( No: of ratings for Artists) (d) RMSE vs log( No: of ratings for Genres)

    Figure 8: These are from the ALS run: RMSE vs No: of ratings (Note: red line shows the actual

    RMSE for that specific type)

    The results have been summarized in the table below. The results from tensor factorization has

    been excluded since we are not confident about the training scheme. The LFL run was not tuned

    for best parameters. Regularization is similar in all the methods except in ALS where its multiplied

    by Nu or Ni.

    13

  • 7/31/2019 KDD Cup 2011 Triton Miners

    14/17

    Method(//k) RMSE (Test set) RMSE with NB correction(Test Set)

    BRISMF(.001/.001/100) 28.6200 -SMF (10/.0001/100) 25.6736 25.4884

    SHMF (10/.0001/100) 25.1183 -

    ALS (1/-/50) 25.0426 -

    LFL (10/.0001/100) 26.5238 -

    Table 4. Current Results on Test Set

    3.8.1 Timing Information

    All these runs were using 7 cores on the same node. It takes around 250 seconds to load all the files

    into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds.

    Method(k) Time in sec per epoch

    SMF (100) 200

    ALS (50) 400

    Table 4. Time

    4 Ideas for further exploration

    There are multiple schemes for residual fitting mentioned in [5] which needs attention. Another

    idea is to exploit hierarchy in the constrained method described in [2] where Uu = Yu +

    IiWiIi

    .

    where Ii is 1 if the user u has rated item i in the training set. We see that the NB correction methodon SMF does improve the test RMSE and NB seems to have the same essence of the constraint

    method which is users which have rated items similarly tend to rate items in similar fashion. Both

    NB correction and constrained method tries to make the latent user features for similar users closer.

    The authors of [2] have noted that for users that have sparse ratings the constrained method pro-

    vides a considerable improvement. We need to first think of a way to parallelize the updates for

    Wi then ponder over how to exploit hierarchy here.

    One of the other contestants have claimed that he has got an RMSE of 23.97 using ALS withvalidation set in training using 100 latent features[6]. He is currently at rank47th. We could op-timistically assume that after adding constrained feature terms and including the validation set in

    training we should reach top 20.

    14

  • 7/31/2019 KDD Cup 2011 Triton Miners

    15/17

    Another scheme to try out is to blend different results. Since we are currently aiming at learning

    more about the dataset and coming up with better RMSE on validation set using a single method,we feel we should leave this till the end.

    Another note is that alternating update and grouping stragegy is faster but RMSE diverges

    for some initialization of latent matrices. The results of ALS tensor factorization is similar to

    ALS without the temporal term on the validation set. But one major difference is that the tensor

    factorization is achieving significantly lower RMSE on the training set (18.xx compared to 20.xx

    ).

    5 Parallelism

    5.1 Alternating update and grouping strategy

    In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while fixing

    I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates.

    The matrix being updated is split into N groups and each group is updated independently.

    Figure 9: Each of the blocks of User matrix is updated independently.

    5.2 Joint SGD Update by grouping strategy

    In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing

    (u,i) pairs as illustrated in the figure below. This scheme can be recursively applied to each of the

    disjoint set for further levels parallelism. However, since the alternating update strategy seems towork for all the algorithms discussed, this scheme has not been implemented yet.

    15

  • 7/31/2019 KDD Cup 2011 Triton Miners

    16/17

    Figure 10: Joint SGD update by grouping independent U and I

    6 Software

    We initially chose Matlab for implementing the baseline methods. But it turned out that parallel

    processing on Matlab turned out be slower than sequential processing on single node. We suspect

    that the slowness is because of some communication overhead in our code which we havent been

    able to debug. After spending several days trying to figure out the problem, We gave up and decided

    to code in C++ which turned out to be good choice. The pthreads library on GNU/Linux is

    being used currently for parallelism.

    As far as we know, there exists no efficient collaborative filtering packages online. As a byprod-

    uct of the competition, we are also trying to build a robust and efficient package in the lines of

    liblinear for regression.

    References

    1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic

    prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia,

    2010

    2. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization, Advances in Neural

    Information Processing Systems. MIT Press, Cambridge, MA, 2008.

    3. Gabor Takcs, Istvan Pilaszy, Bottyan Nemeth, Domonkos Tikk, Scalable Collaborative Fil-

    tering Approaches for Large Recommender Systems, Journal of Machine Learning Re-

    search 10: 623-656 (2009)

    4. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Fil-

    tering for the Netflix Prize,In AAIM(2008) 337-348

    16

  • 7/31/2019 KDD Cup 2011 Triton Miners

    17/17

    5. A. Paterek. Improving regularized Singular Value Decomposition for collaborative filtering.

    Proceedings of KDD Cup and Workshop, 2007.

    6. http://groups.google.com/group/graphlab-kdd

    17