Music r ecommendations at Spotify
description
Transcript of Music r ecommendations at Spotify
Music recommendations at Spotify
Erik Bernhardsson [email protected]
Spotify
- Launched in 2009- Available in 17 countries- 20M active users, 5M paying subscribers- Peak at 5k tracks/s, 1M logged in users- 20M tracks
Some applications
Recommendation stuff at Spotify
- Related artists:
Recommendation stuff at Spotify, cont…
More!
How can we find music?
Recommendations
- Manual classification- Feature extraction- Social media analysis, web scraping, metadata based- Collaborative filtering
Pandora & Music Genome Project
- Classifies tracks in terms of 400 attributes- Each track takes 20-30 minutes to classify- A distance function finds similar tracks
- “Subtle use of strings”- “Epic buildup”- “Acid Jazz roots”- “Beats made for dancing”- “Trippy soundscapes”- “Great trombone solo”- …
Scraping the web is another approach
Feature extraction
Collaborative filtering
Idea:- If two movies x, y get similar ratings then they are probably
similar- If a lot of users all listen to tracks x, y, z, then those tracks
are probably similar
Collaborative filtering
Get data
… lots of data
Aggregate data
Throw away temporal information and just look at the number of times
OK, so now we have a big matrix
… very big matrix
Throw out all the temporal data:
Supervised collaborative filtering is pretty much matrix completion
Supervised learning: Matrix completion
Supervised: evaluating rec quality
Unsupervised learning
- Trying to estimate the density- i.e. predict probability of future events
Try to predict the future given the past
How can we find similar items
We can calculate correlation coefficient as an item similarity
- Use something like Pearson, Jaccard, …
Amazon did this for “customers who bought this also bought”
- US patent 7113917
Parallelization is hard though
Can speed this up using various LSH tricks
- Twitter: Dimension Independent Similarity Computation (DISCO)
Are there other approaches?
Natural Language Processing has a lot of similar problems
…matrix factorization is one idea
Matrix factorization
Matrix factorization
- Want to get user vectors and item vectors- Assume f latent factors (dimensions) for each user/item
- Hofmann, 1999- Also called PLSI
Probabilistic Latent Semantic Analysis (PLSA)
PLSA, cont.
+ a bunch of constraints:
PLSA, cont.
Optimization problem: maximize log-likelihood
PLSA, cont.
“Collaborative Filtering for Implicit Feedback Datasets”
- Hu, Koren, Volinsky (2008)
“Collaborative Filtering for Implicit Feedback Datasets”, cont.
Here is another method we use
What happens each iteration
- Assign all latent vectors small random values- Perform gradient ascent to optimize log-likelihood
Calculate derivative and do gradient ascent
- Assign all latent vectors small random values- Perform gradient ascent to optimize log-likelihood
2D iteration example
Vectors are pretty nice because things are now super fast
- User-item score is a dot product:
- Item-item similarity score is a cosine similarity:
- Both cases have trivial complexity in the number of factors f:
Example: item similarity as a cosine of vectors
Two dimensional example for tracks
We can rank all tracks by the user’s vector
So how do we implement this?
Hadoop at Spotify
One iteration of a matrix factorization algorithm
“Google News personalization: scalable online collaborative filtering”
So now we solved the problem of recommendations right?
Actually what we really want is to apply it to other domains
Radio
- Artist radio: find related tracks- Optimize ensemble model based on skip/thumbs data
Learning from feedback is actually pretty hard
A/B testing
More applications!!!
Last but not least: we’re hiring!
Thank you