Using the search engine as recommendation engine
-
Upload
lars-marius-garshol -
Category
Technology
-
view
2.006 -
download
0
description
Transcript of Using the search engine as recommendation engine
Recommendations from the search engineSesam Hackathon, Warsaw, 2014-03-23
Lars Marius Garshol, [email protected], http://twitter.com/larsga1
2
This whole presentation is about Ted Dunning’s proposed approach to recommendations
Based on his 1993 paper (below)– references at the end
Very simple method, dead easy to implement– seems to work pretty well
Inspiration
3
Usually designed as prediction of ratings– Dunning believes this is the wrong approach– people’s ratings don’t necessarily reflect what
they’ll buy– go by what people do rather than what they say
You don’t want to recommend Bob Dylan– everyone’s already heard about him, and know
what they think– you want to recommend things that are new to
the user
You don’t want to recommend things everyone likes– most likely they already know that
Thoughts on recommendations
4
Step 1– work out which things tend to occur together– that is, if you buy this, you’re likely to also buy
this– however, we only want pairs which are
statistically significant
Step 2– index up the significant pairs in a search engine– use search to produce the actual results
The actual approach
Statistically significant co-occurrencePart the first
User
Item
u1 i1
u1 i2
u2 i1
u3 i2
u3 i3
u3 i4
... ...
The starting point
Some kind of log of user actions
User has– bought a movie | album | book | ...– opened a document– ...
From this raw material, we can work out what things tend to go together– and whether this is significant
7
8
i1 i2 i3 i4 i5 i6 i7
i1 23 42 0 0 5 7
i2 23 6 1 129 2 10
i3 42 6 3 0 492 1
i4 0 1 3 2 3 1
i5 0 129 0 2 94 2
i6 5 2 492 3 94 1
i7 7 10 1 1 2 1
Item-to-item matrix
9
k[0][0] = the number in the matrix on previous slide
k[0][1] = the sum of that whole column minus k[0][0]
k[1][0] = the sum of that whole row minus k[0][0]
k[1][1] = the sum of the entire matrix minus k[0][0] minus k[1][0] minus k[0][1]
Producing the k 2x2 matrix
How to compute the k matrix for a given cell in the matrixon the previous slide
If the output of LLR(k) is above some threshold, the pair is considered significant.
10
Check the Python code on– https://github.com/larsga/py-snippets/tree/
master/machine-learning/llr– this requires a lot of memory and CPU
Or just use Mahout– RowSimilarityJob does exactly this
Doing it for real
Search engine as recommenderPart the second
12
Take all the items and index them up with the search engine in the usual way– that is, each title has an id, a title, a description,
etc
Then, add a “magic” field– put into it the IDs of all the items that appear in
a significant pair with this item– let’s call this field “indicators”
Now we’re ready to do recommendations
Indexing with the search engine
13
Collect some set of items for which the user has expressed a preference– by buying them, looking at them, rating them,
whatever
The IDs of these items are your query– search the “indicators” field– the search results are your recommendations
That’s it!– pack up, go home
Doing recommendations
14
Imagine that you’re searching for movies, and you type “the godfather”– “the” appears in all documents, so documents matching
that get a low relevance score– “godfather” appears in very few documents, so matches
on that get a high score– this is basically TF/IDF in a nutshell
Now, imagine you liked two movies: “The Godfather” and “The Daytrippers”– nearly all movies have “The Godfather” as an indicator– very few have “The Daytrippers”– the second will therefore influence recommendations
much more
Why does it work?
Trying it out for realPart the third
16
Again, the code is on Github– very simple webapp based on web.py and
Lucene– https://github.com/larsga/py-snippets/tree/
master/machine-learning/llr
The underlying data is the MovieLens dataset– 10 million ratings of 10,000 movies by 72,000
users– http://grouplens.org/datasets/movielens/
Real demo with real data
17
llr.py– this chews the data, producing the significant
pairs– takes huge amount of memory and about 30
minutes– have made absolutely no attempts to optimize it
llr_index.py– reads output of previous script, makes Lucene
index
recom-ui.py– the actual web application
Three scripts
18
19
20
Liked one movie
21
Liked two movies
Movies with highest llr scoretogether with this movie
22
Liked three movies
Recommendations are actually now spot-on. At least for me.
23
class Movie:
def GET(self, movieid):
nocache()
doc = search.do_query('id', movieid)[0]
#recoms = search.do_query('indicators', movieid)
recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets]
if hasattr(session, 'liked'):
youlike = search.do_query('indicators', session.liked)
else:
youlike = []
return render.movie(doc, recoms, youlike)
Complete code for movie page
Further workWinding up
25
Tweak the parameters a bit to see what happens
Can we support a “Dislike” button?
Test it with more kinds of data
Learn how to do this with Mahout
Things left to do
26
What is this?
From Ted Dunning’s slides
27
And this?
From Ted Dunning’s slides
28
And this?
From Ted Dunning’s slides
29
The original 1993 paper– http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.14.5962
Ebook with lots of background but little detail– http://www.mapr.com/practical-machine-learning
Slides covering the same material– www.slideshare.net/tdunning/building-multimodal-
recommendation-engines-using-search-engines
Blog post with actual equations– http://tdunning.blogspot.com/2008/03/surprise-and-
coincidence.html
References