WhoToFollow @Spotify

Rohan AgrawalFall 2013 Internship @ Spotify

WhoToFollow Recommendations

User Recommendation Problem

● First step: Candidate set generation● Second step: Rank candidates using a supervised ML model● Problem?

● Need to generate training data for the ML model● Generate candidates (2 hop) for users in an old social graph, say 1 month

before● Look at current social graph, if a link was established between user, candidate

in the current graph, treat the edge as a positive class.● If a link was not established, treat the edge as a negative class.● Not the best way to get Training Data as edges actually formed depend on

the previous recommendation algorithm, but a good start.

Candidate Set Generation

Which Users Do you want to consider for WTF recs● Simple Approach: All Users at 2 hops are candidates (ranked by the

total number of hops, just take the top 200)● Complex Approaches

● Use personalized PageRank, SALSA to find candidates for each user.● Use user interaction to get weighted social graph, then perform above

techniques.

Many users (around 50% users do not have 2 hop neighborhood) ● Use facebook friends as candidates (only 16% users don’t have fb

candidates, and 5 % of users don’t have fb candidates or 2 hop neighbors)

● Use Approximate Nearest Neighbors

Extracting Features

● hops: number of paths of length 2 between user1 and user2● hopslog: hops/log(# of subscribers user2 has)● common: no. of common neighbors shared by user1 and user2● jaccard: common/(union of neighbors of user1 and user2)● cosine: cosine similarity of user vectors of user1 and user2● adamic: summation over neighbors of user1 [1/log(# of subscribers of

the neighbor)]● indegree: in degree of user2● fraction_n2: for 2 users i and j, fraction of subscriptions of i that are

following j● fraction_n1: for 2 users i and j, fraction of subscriptions of j that have i

follows● pref_attachment: number of subscriptions of i * num of followers of j● reverse_edge: of i,j = 1 if j follows i● Label: positive or negative class, as described in slide 2.

Ranking Features by Importance

● 0.185521009562 hops● 0.151976624315 fraction_n2● 0.126571252655 fraction_n1● 0.126321244854 cosine● 0.0828860325682 pref_attachment● 0.0709010797719 indegree_j● 0.0660478462424 hopslog● 0.0649419577136 adamic● 0.0531705297389 common● 0.0372079185808 jaccard● 0.0344545039974 reverse_edge

As given by Gradient Boosted Regression Trees. This ranking should be looked at just as an indication because many features like fraction_n2, fraction_n1, jaccard are dependent on each other, and features like cosine similarity don’t depend on other features.

Extracting Features

● More Features that can be considered in the future:● Facebook friend Boolean, PageRank score, Geographic Distance, Age

Difference, …

Machine Learning Models

● Tried Logistic Regression, SVM, Random Forests, in the end Gradient Boosted Decision Trees give the best performance. (68 - 69%)

● Though the model they’ve learnt depends on the current module which is serving WTF recs.

● When pushed to production, model can learn from a better training set.

Results from testing with Spotify Employees

● Total Records: 1251● Yes / Total = 22.14%● Yes and I know the recommendation / Total responses where users

knew their recommendation = 61.11%● Yes and I like the persons musical taste / Total responses where users

liked their recommendations taste = 61.36%● Yes, I like and Know the recommended user / Total people who liked

and knew their recommendations = 78.57%● Yes, I like users taste but I don’t know user / Total people who like taste

and didn’t know their recommendations= 35.7%● Yes, I know the user but dislike users taste / Total people who disliked

taste and knew their recommendations= 17.8%

Optimizations:

● First I had converted each userID into an integer, loaded the entire dataset into memory, and then done the computation.

● This was very difficult to convert to Multiprocessing Code. (Each process tried to make a copy of the graph, which was not possible, creating a shared object was very slow)

● Best option was to use a DataBase, because only retrieval was needed to be done.

● Sparkey preferred to Tokyo Cabinet, because time to construct index was much lower.

● 1 Process: Very Very Slow, 10 users per second● bound by call to OpenGraph API for spotify users’ FB friends

● 100 Processes: 92.6 users per second, 1 Million Users in 180 minutes● 150 Processes: 116.7 users per second, 1.8 Million Users in 257 minutes

Resources

● Seminal paper by Kleinberg http://www.cs.cornell.edu/home/kleinber/link-pred.pdf

● Supervised Learning http://www3.nd.edu/~dial/papers/KDD10.pdf● Twitter http://www.stanford.edu/~rezab/papers/wtf_overview.pdf

● Twitter’s WTF problem is pretty similar to ours, asymmetric follows● Future:

● Supervised Random Walks http://cs.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf

● Large Scale Twitter http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

● Fast Page Rank http://arxiv.org/abs/1006.2880

http://www.cs.cornell.edu/home/kleinber/link-pred.pdf



http://www3.nd.edu/%7Edial/papers/KDD10.pdf

http://www.stanford.edu/%7Erezab/papers/wtf_overview.pdf

http://cs.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf



http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf



http://arxiv.org/abs/1006.2880

Questions?

Thank YOU!

WhoToFollow @Spotify

Technology

Transcript of WhoToFollow @Spotify