Mahout part1

28
Mahout in Action Part 1 Yasmine M. Gaber 28 February 2013

description

Part one of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/

Transcript of Mahout part1

Page 1: Mahout part1

Mahout in ActionPart 1

Yasmine M. Gaber28 February 2013

Page 2: Mahout part1

Agenda

Meet Apache Mahout Part 1: Recommendation Part 2: Clustering Part 3: Classification

Page 3: Mahout part1

Meet Apache Mahout

It is an open source machine learning library from Apache

It is scalable

It is a Java library

It can be used with Hadoop to deal with large scale data.

Page 4: Mahout part1

Famous Engines Recommender engines: Amazon.comNetflix Dating sites like Líbímseti Social networking sites like Facebook

Clustering engines:Google NewsSearch engines like Clusty

Classification engines:Spam emailsGoogle’s PicasaOptical character recognition softwareApple’s Genius feature in iTunes

Page 5: Mahout part1

Recommendations

Page 6: Mahout part1

Recommender Input

A preference consists of a user ID and an item ID, user’s preference for the item

It is .csv file

Page 7: Mahout part1

Create Recommender

Page 8: Mahout part1

Recommender Evaluation

Average difference vs Root-mean-square

Page 9: Mahout part1

Mahout RecommenderEvaluator

Page 10: Mahout part1

Precision and Recall

Page 11: Mahout part1

RecommenderIRStatsEvaluator

Page 12: Mahout part1

Representing Recommender Data

Preference object new GenericPreference(123, 456, 3.0f)

Preference Array

Page 13: Mahout part1

Representing Recommender Data

Preference Array

FastByIDMap and FastIDSet

Page 14: Mahout part1

In-memory DataModels

GenericDataModel

File-based data

Refreshable components

Database-based data

Page 15: Mahout part1

Coping without preference values

Page 16: Mahout part1

Coping without preference values

Page 17: Mahout part1

User-based Recommender

The algorithm

for every item i that u has no preference for yet

for every other user v that has a preference for i

compute a similarity s between u and v

incorporate v's preference for i, weighted by s, into a running average

return the top items, ranked by weighted average

Page 18: Mahout part1

Recommender Components

Data model, implemented via DataModel

User-user similarity metric, implemented via UserSimilarity

User neighborhood definition, implemented via UserNeighborhood

Recommender engine, implemented via a Recommender (here, GenericUserBasedRecommender)

Page 19: Mahout part1

GenericUserBasedRecommender

Page 20: Mahout part1

User Neighborhoods

Fixed-size neighborhoods

Threshold-based neighborhood

Page 21: Mahout part1

similarity metrics

Pearson correlation–based similarity It is a number between –1 and 1 that measures

the tendency of two series of numbers, paired up one-to-one, to move together

Problems: It doesn’t take into account the number of items in

which two users’ preferences overlap, which is probably a weakness in the context of recommender engines.

If two users overlap on only one item, no correlation can be computed because of how the computation is defined

Page 22: Mahout part1

similarity metrics

Euclidean distance similarity 1 / (1+euclidean distance)

Cosine measure similarity between –1 and 1

Tanimoto coefficient similarity The ratio of the size of the

intersection to the size of

the union of their preferred items

Page 23: Mahout part1

Item-based recommendation

The algorithm

for every item i that u has no preference for yet

for every item j that u has a preference for

compute a similarity s between i and j

add u's preference for j, weighted by s, to a running average

return the top items, ranked by weighted average

Page 24: Mahout part1

GenericItemBasedRecommender

Page 25: Mahout part1

Slope-one recommender

The algorithm

for every item i the user u expresses no preference for

for every item j that user u expresses a preference for

find the average preference difference between j and i

add this diff to u's preference value for j

add this to a running average

return the top items, ranked by these averages

Page 26: Mahout part1

Taking Recommender to Production

Page 27: Mahout part1

User-based recommenders

Page 28: Mahout part1

Thank You

Contact at:Email: [email protected]: Twitter.com/yasmine_mohamed