Using the search engine as recommendation engine

Recommendations from the search engineSesam Hackathon, Warsaw, 2014-03-23

Lars Marius Garshol, [email protected], http://twitter.com/larsga1

2

This whole presentation is about Ted Dunning’s proposed approach to recommendations

Based on his 1993 paper (below)– references at the end

Very simple method, dead easy to implement– seems to work pretty well

Inspiration

3

Usually designed as prediction of ratings– Dunning believes this is the wrong approach– people’s ratings don’t necessarily reflect what

they’ll buy– go by what people do rather than what they say

You don’t want to recommend Bob Dylan– everyone’s already heard about him, and know

what they think– you want to recommend things that are new to

the user

You don’t want to recommend things everyone likes– most likely they already know that

Thoughts on recommendations

4

Step 1– work out which things tend to occur together– that is, if you buy this, you’re likely to also buy

this– however, we only want pairs which are

statistically significant

Step 2– index up the significant pairs in a search engine– use search to produce the actual results

The actual approach

Statistically significant co-occurrencePart the first

User

Item

u1 i1

u1 i2

u2 i1

u3 i2

u3 i3

u3 i4

... ...

The starting point

Some kind of log of user actions

User has– bought a movie | album | book | ...– opened a document– ...

From this raw material, we can work out what things tend to go together– and whether this is significant

8

i1 i2 i3 i4 i5 i6 i7

i1 23 42 0 0 5 7

i2 23 6 1 129 2 10

i3 42 6 3 0 492 1

i4 0 1 3 2 3 1

i5 0 129 0 2 94 2

i6 5 2 492 3 94 1

i7 7 10 1 1 2 1

Item-to-item matrix

9

k[0][0] = the number in the matrix on previous slide

k[0][1] = the sum of that whole column minus k[0][0]

k[1][0] = the sum of that whole row minus k[0][0]

k[1][1] = the sum of the entire matrix minus k[0][0] minus k[1][0] minus k[0][1]

Producing the k 2x2 matrix

How to compute the k matrix for a given cell in the matrixon the previous slide

If the output of LLR(k) is above some threshold, the pair is considered significant.

10

Check the Python code on– https://github.com/larsga/py-snippets/tree/

master/machine-learning/llr– this requires a lot of memory and CPU

Or just use Mahout– RowSimilarityJob does exactly this

Doing it for real

Search engine as recommenderPart the second

12

Take all the items and index them up with the search engine in the usual way– that is, each title has an id, a title, a description,

etc

Then, add a “magic” field– put into it the IDs of all the items that appear in

a significant pair with this item– let’s call this field “indicators”

Now we’re ready to do recommendations

Indexing with the search engine

13

Collect some set of items for which the user has expressed a preference– by buying them, looking at them, rating them,

whatever

The IDs of these items are your query– search the “indicators” field– the search results are your recommendations

That’s it!– pack up, go home

Doing recommendations

14

Imagine that you’re searching for movies, and you type “the godfather”– “the” appears in all documents, so documents matching

that get a low relevance score– “godfather” appears in very few documents, so matches

on that get a high score– this is basically TF/IDF in a nutshell

Now, imagine you liked two movies: “The Godfather” and “The Daytrippers”– nearly all movies have “The Godfather” as an indicator– very few have “The Daytrippers”– the second will therefore influence recommendations

much more

Why does it work?

Trying it out for realPart the third

16

Again, the code is on Github– very simple webapp based on web.py and

Lucene– https://github.com/larsga/py-snippets/tree/

master/machine-learning/llr

The underlying data is the MovieLens dataset– 10 million ratings of 10,000 movies by 72,000

users– http://grouplens.org/datasets/movielens/

Real demo with real data

17

llr.py– this chews the data, producing the significant

pairs– takes huge amount of memory and about 30

minutes– have made absolutely no attempts to optimize it

llr_index.py– reads output of previous script, makes Lucene

index

recom-ui.py– the actual web application

Three scripts

20

Liked one movie

21

Liked two movies

Movies with highest llr scoretogether with this movie

22

Liked three movies

Recommendations are actually now spot-on. At least for me.

23

class Movie:

def GET(self, movieid):

nocache()

doc = search.do_query('id', movieid)[0]

#recoms = search.do_query('indicators', movieid)

recoms = [search.do_query('id', movieid)[0] for movieid in doc.bets]

if hasattr(session, 'liked'):

youlike = search.do_query('indicators', session.liked)

else:

youlike = []

return render.movie(doc, recoms, youlike)

Complete code for movie page

Further workWinding up

25

Tweak the parameters a bit to see what happens

Can we support a “Dislike” button?

Test it with more kinds of data

Learn how to do this with Mahout

Things left to do

26

What is this?

From Ted Dunning’s slides

27

And this?


28

And this?


29

The original 1993 paper– http://citeseerx.ist.psu.edu/viewdoc/summary?

doi=10.1.1.14.5962

Ebook with lots of background but little detail– http://www.mapr.com/practical-machine-learning

Slides covering the same material– www.slideshare.net/tdunning/building-multimodal-

recommendation-engines-using-search-engines

Blog post with actual equations– http://tdunning.blogspot.com/2008/03/surprise-and-

coincidence.html

References

Using the search engine as recommendation engine

Technology

Transcript of Using the search engine as recommendation engine