LA HUG Dec 2011 - Recommendation Talk

29
Dec 2011 – LA HUG – Santa Monica, CA Mahout, CDH3, and Recommendation Josh Patterson | Sr Solution Architect

description

Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Transcript of LA HUG Dec 2011 - Recommendation Talk

Page 1: LA HUG Dec 2011 - Recommendation Talk

Dec 2011 – LA HUG – Santa Monica, CA

Mahout, CDH3, and RecommendationJosh Patterson | Sr Solution Architect

Page 2: LA HUG Dec 2011 - Recommendation Talk

Who is Josh Patterson?

[email protected] – Twitter: @jpatanooga

• Master’s Thesis: self-organizing mesh networks – Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

• Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)– Led team which designed classification techniques for time series and

Map Reduce

• Open source work at – http://openpdc.codeplex.com– https://github.com/jpatanooga

• Today– Sr. Solutions Architect at Cloudera

fworley
Recommend for backup
Page 3: LA HUG Dec 2011 - Recommendation Talk

Outline

• Intro to Recommendation• Recommendation with Mahout and CDH3

3

Page 4: LA HUG Dec 2011 - Recommendation Talk

Recommendation

“I know I've made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I've still got the greatest enthusiasm and confidence in the mission. And I want to help you. ”

--- HAL from “2001: A Space Odyssey”

4

Page 5: LA HUG Dec 2011 - Recommendation Talk

Information Explosion

• Amount of data, articles, shows exploding– Hard to know what to pay attention to– Be nice if it was personalized to my own tastes

• Issues at scale– Heap size limits become issue with large

number of preferences• > 1 Billion preferences

– “real time” recommenders have issues as well with scale

Copyright 2010 Cloudera Inc. All rights reserved5

Page 6: LA HUG Dec 2011 - Recommendation Talk

User-based recommendations

• Look for users who share the same ratings patterns with the active user– looking at the notion of similarity between

users based on preferences/actions/ratings of those users

• So we can recommend the same things to similar users

Page 7: LA HUG Dec 2011 - Recommendation Talk

Item-based recommendations

• Item based recommenders are derived from how similar items are to items– Users who bought X also bought Y

• Compute similarity matrix between items

Page 8: LA HUG Dec 2011 - Recommendation Talk

Item vs User Based

• Algorithms are similar– But not entirely symmetric

• Item based– Scales up as the number of items increases

• If the number of items is relatively low compared to the number of users, performance could be better

– Items tend to change less than users

• User based– Running time goes up as the number of users

increases

Copyright 2010 Cloudera Inc. All rights reserved8

Page 9: LA HUG Dec 2011 - Recommendation Talk

Recommendation in Mahout

• Not a single recommender engine– Assortment of components

• Components can be plugged together and customized– We target a specific domain with a custom

built recommender– Need experimentation to get good results

Copyright 2010 Cloudera Inc. All rights reserved9

Page 10: LA HUG Dec 2011 - Recommendation Talk

Co-Occurrence Matrix

• Example: – If we have 10 users, and all of them express a preference

for items A and B• A and B are said to co-occur 10 times

• Can be thought of much like similarity– The more we see two items occur together– The greater the chance the two items are related somehow

• Producing a Co-Occurrence matrix ends up being a simple exercise of counting– we compute number of times the pair occurs together per

user– Works well distributed

Copyright 2010 Cloudera Inc. All rights reserved10

Page 11: LA HUG Dec 2011 - Recommendation Talk

Simple Recommender Input

UserID, ItemID, Rating

10, 1000, 5.0

10, 1001, 3.0

10, 1004, 2.5

13, 1001, 3.5

13, 1002, 4.5

13, 1003, 1.0

13, 1004, 3.5

15, 1000, 4.5

15, 1001, 3.5

15, 1002, 2.5

Copyright 2010 Cloudera Inc. All rights reserved11

Page 12: LA HUG Dec 2011 - Recommendation Talk

Simple Co-Occurrence Matrix

Copyright 2010 Cloudera Inc. All rights reserved12

1000 1001 1002 1003 1004

1000 2 2 1 0 1

1001 2 3 2 1 2

1002 1 2 2 1 1

1003 0 1 1 1 1

1004 1 2 1 1 2

Page 13: LA HUG Dec 2011 - Recommendation Talk

User’s Preferences as a Vector

• In other recommendation algos we look at users as points in space– Euclidean distances as similarity

• In a data model with n items, user preferences are like a vector over n dimensions– With 1 dimension for each item– Creates sparse vector

• Example– User 10: { 5.0, 3.0, 0.0, 0.0, 2.5 }

Copyright 2010 Cloudera Inc. All rights reserved13

Page 14: LA HUG Dec 2011 - Recommendation Talk

Computing Recommendations

• Multiply the user vector (as column vector) vs the co-occurrence matrix– User column vector x each item row vector

• Result: vector whose dimension is equal to the number of items– Values in results vector R are recommended

as “best recommendations”

Copyright 2010 Cloudera Inc. All rights reserved14

Page 15: LA HUG Dec 2011 - Recommendation Talk

Calculating R: Example

Copyright 2010 Cloudera Inc. All rights reserved15

1000 1001 1002 1003 1004

1000 2 2 1 0 1

1001 2 3 2 1 2

1002 1 2 2 1 1

1003 0 1 1 1 1

1004 1 2 1 1 2

UserID

5.0

3.0

0.0

0.0

2.5

R

18.5

24

13.5

5.5

16

x =

R value for item 1002:

1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) == 13.5

Page 16: LA HUG Dec 2011 - Recommendation Talk

Recommendations

• If a user has already indicated a preference for an item, we don’t want to recommend it

• We take the remaining items ranked by their R value– Here it would be 1002 at 13.5

• Followed by 1003 at 5.5

Copyright 2010 Cloudera Inc. All rights reserved16

R

18.5

24

13.5

5.5

16

10, 1000, 5.010, 1001, 3.010, 1004, 2.5

Page 17: LA HUG Dec 2011 - Recommendation Talk

Recommendations with Mahout and CDH3u2

“Dave Bowman: I don't know; I think so. You know of course though he's right about the 9000 series having a perfect operational record.

They do.

Dr. Frank Poole: Unfortunately that sounds a little like famous last words. ”

--- “2001:A Space Odyssey”

17

Page 18: LA HUG Dec 2011 - Recommendation Talk

Step 1: Install CDH3u2

• Setup CDH3u2– https://ccp.cloudera.com/display/CDHDOC/C

DH3+Quick+Start+Guide

– Setup in Pseudo-distributed mode for this demo if you don’t have a cluster

Copyright 2010 Cloudera Inc. All rights reserved18

Page 19: LA HUG Dec 2011 - Recommendation Talk

Step 2: Install Mahout

• Setup Apache Mahout with CDH3– https://ccp.cloudera.com/display/CDHDOC/M

ahout+Installation

– Make sure $JAVA_HOME is set or Mahout will complain

Copyright 2010 Cloudera Inc. All rights reserved19

Page 20: LA HUG Dec 2011 - Recommendation Talk

Step 3: Get Grouplens Data

• Download– http://www.grouplens.org/system/files/ml-1m.zip

• Format– UserID::MovieID::Rating::Timestamp

• where– UsersIDs are integers– MovieIDs are integers– Ratings are 1 through 5 “stars” (integers)– Time stamp is seconds since the epoch

• Each user has at least 20 ratings

Copyright 2010 Cloudera Inc. All rights reserved20

Page 21: LA HUG Dec 2011 - Recommendation Talk

Step 4: Prep Data

• This file isn’t exactly how Mahout prefers, but this is an easy fix– Mahout is looking for a CSV file with lines of

the form:• userID, itemID, value

• From bash run– tr -s ':' ',' < ratings.dat | cut -f1-3 -d, >

ratings.csv

Copyright 2010 Cloudera Inc. All rights reserved21

Page 22: LA HUG Dec 2011 - Recommendation Talk

Step 5: Generate Recommendations

• Input to this job is going to be the “ratings.csv” file we generated of the format:– userID, itemID, value

• We also want to give it a list of userIDs to generate recommendations for

• Output of the recommendation job will be another CSV file with the layout of:– userID [ itemID, score, ... ]– Represents the userIDs with their recommended

itemIDs along with the preference scores

Copyright 2010 Cloudera Inc. All rights reserved22

Page 23: LA HUG Dec 2011 - Recommendation Talk

Step 5: Command Line

• Put ratings file in HDFS– Hadoop fs –put ratings.csv [input-hdfs-path]

• Put user file in HDFS– Let’s put “6040” on a single line in a file and put that

in HDFS• hadoop fs -put [my_local_file]

[user_file_location_in_hdfs]

• Now we can run the recommender job– mahout recommenditembased --input [input-hdfs-

path] --output [output-hdfs-path] --tempDir [tmp-hdfs-path] --usersFile [user_file_location_in_hdfs]

Copyright 2010 Cloudera Inc. All rights reserved23

Page 24: LA HUG Dec 2011 - Recommendation Talk

Take a Look at the Results

• Cat output of job– hadoop fs -cat [output-hdfs-path]/part-r-00000

• Which should look like:– 6040 [1941:5.0,1904:5.0,2859:5.0,3811:5.0,3814:5.0,14:5.0,17:5.0,3795:5.0,3794:5.0,3793:5.0]

Copyright 2010 Cloudera Inc. All rights reserved24

Page 25: LA HUG Dec 2011 - Recommendation Talk

Questions? (Thank You!)

• Recommendation Tutorial based on:– http://www.cloudera.com/blog/2011/11/recom

mendation-with-apache-mahout-in-cdh3/

• Cloudera’s Distribution including Apache Hadoop (CDH):– http://www.cloudera.com

• Apache Mahout– http://mahout.apache.org

25

Page 26: LA HUG Dec 2011 - Recommendation Talk

More?

• Look at www.cloudera.com/training to learn more about Hadoop

• Read www.cloudera.com/blog• Lots of great use cases.

• Check out the downloads page at• www.cloudera.com/downloads• Get your own copy of Cloudera Distribution for Apache Hadoop

(CDH)• Grab Demo VMs, Connectors, other useful tools.

• Contact Josh with any questions at • [email protected]

Copyright 2010 Cloudera Inc. All rights reserved26

Page 27: LA HUG Dec 2011 - Recommendation Talk

References

• S. Owen, R. Anil, T. Dunning, E. Friedman: Mahout in Action

• Sarwar et al.: Item-Based Collaborative Filtering Recommendation Algorithms

• Apache Mahout Wiki:– http://mahout.apache.org/

Copyright 2010 Cloudera Inc. All rights reserved27

Page 28: LA HUG Dec 2011 - Recommendation Talk

Workflow

• Job 1– Preprocess data if needed

• Job 2– Create User Vectors

• Job 3– Count Users

• Job 4– Prune and Transpose

• Job 5– RowSimilarityJob

• Weights• pairwiseSimilarity• asMatrix

• Job 6– Pre Partial Multiply 1

• Job 7– Pre Partial Multiply 2

• Job 8– Partial Multiply

• Job 9

Copyright 2010 Cloudera Inc. All rights reserved28

Page 29: LA HUG Dec 2011 - Recommendation Talk

Temp Files Generated

• countUsers• itemIDIndex• itemUserMatrix• pairwiseSimilarity• partialMultiply• partialMultiply1• partialMultiply2• similarityMatrix• userVectors• weights

Copyright 2010 Cloudera Inc. All rights reserved29