WhoToFollow @Spotify
-
Upload
rohan-agrawal -
Category
Technology
-
view
878 -
download
1
description
Transcript of WhoToFollow @Spotify
Rohan AgrawalFall 2013 Internship @ Spotify
WhoToFollow Recommendations
User Recommendation Problem
● First step: Candidate set generation● Second step: Rank candidates using a supervised ML model● Problem?
● Need to generate training data for the ML model● Generate candidates (2 hop) for users in an old social graph, say 1 month
before● Look at current social graph, if a link was established between user, candidate
in the current graph, treat the edge as a positive class.● If a link was not established, treat the edge as a negative class.● Not the best way to get Training Data as edges actually formed depend on
the previous recommendation algorithm, but a good start.
Candidate Set Generation
Which Users Do you want to consider for WTF recs● Simple Approach: All Users at 2 hops are candidates (ranked by the
total number of hops, just take the top 200)● Complex Approaches
● Use personalized PageRank, SALSA to find candidates for each user.● Use user interaction to get weighted social graph, then perform above
techniques.
Many users (around 50% users do not have 2 hop neighborhood) ● Use facebook friends as candidates (only 16% users don’t have fb
candidates, and 5 % of users don’t have fb candidates or 2 hop neighbors)
● Use Approximate Nearest Neighbors
Extracting Features
● hops: number of paths of length 2 between user1 and user2● hopslog: hops/log(# of subscribers user2 has)● common: no. of common neighbors shared by user1 and user2● jaccard: common/(union of neighbors of user1 and user2)● cosine: cosine similarity of user vectors of user1 and user2● adamic: summation over neighbors of user1 [1/log(# of subscribers of
the neighbor)]● indegree: in degree of user2● fraction_n2: for 2 users i and j, fraction of subscriptions of i that are
following j● fraction_n1: for 2 users i and j, fraction of subscriptions of j that have i
follows● pref_attachment: number of subscriptions of i * num of followers of j● reverse_edge: of i,j = 1 if j follows i● Label: positive or negative class, as described in slide 2.
Ranking Features by Importance
● 0.185521009562 hops● 0.151976624315 fraction_n2● 0.126571252655 fraction_n1● 0.126321244854 cosine● 0.0828860325682 pref_attachment● 0.0709010797719 indegree_j● 0.0660478462424 hopslog● 0.0649419577136 adamic● 0.0531705297389 common● 0.0372079185808 jaccard● 0.0344545039974 reverse_edge
As given by Gradient Boosted Regression Trees. This ranking should be looked at just as an indication because many features like fraction_n2, fraction_n1, jaccard are dependent on each other, and features like cosine similarity don’t depend on other features.
Extracting Features
● More Features that can be considered in the future:● Facebook friend Boolean, PageRank score, Geographic Distance, Age
Difference, …
Machine Learning Models
● Tried Logistic Regression, SVM, Random Forests, in the end Gradient Boosted Decision Trees give the best performance. (68 - 69%)
● Though the model they’ve learnt depends on the current module which is serving WTF recs.
● When pushed to production, model can learn from a better training set.
Results from testing with Spotify Employees
● Total Records: 1251● Yes / Total = 22.14%● Yes and I know the recommendation / Total responses where users
knew their recommendation = 61.11%● Yes and I like the persons musical taste / Total responses where users
liked their recommendations taste = 61.36%● Yes, I like and Know the recommended user / Total people who liked
and knew their recommendations = 78.57%● Yes, I like users taste but I don’t know user / Total people who like taste
and didn’t know their recommendations= 35.7%● Yes, I know the user but dislike users taste / Total people who disliked
taste and knew their recommendations= 17.8%
Optimizations:
● First I had converted each userID into an integer, loaded the entire dataset into memory, and then done the computation.
● This was very difficult to convert to Multiprocessing Code. (Each process tried to make a copy of the graph, which was not possible, creating a shared object was very slow)
● Best option was to use a DataBase, because only retrieval was needed to be done.
● Sparkey preferred to Tokyo Cabinet, because time to construct index was much lower.
● 1 Process: Very Very Slow, 10 users per second● bound by call to OpenGraph API for spotify users’ FB friends
● 100 Processes: 92.6 users per second, 1 Million Users in 180 minutes● 150 Processes: 116.7 users per second, 1.8 Million Users in 257 minutes
Resources
● Seminal paper by Kleinberg http://www.cs.cornell.edu/home/kleinber/link-pred.pdf
● Supervised Learning http://www3.nd.edu/~dial/papers/KDD10.pdf● Twitter http://www.stanford.edu/~rezab/papers/wtf_overview.pdf
● Twitter’s WTF problem is pretty similar to ours, asymmetric follows● Future:
● Supervised Random Walks http://cs.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf
● Large Scale Twitter http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
● Fast Page Rank http://arxiv.org/abs/1006.2880
Questions?
Thank YOU!