Post on 19-Aug-2014
description
A QUICK TUTORIAL ON MAHOUT’S RECOMMENDATION ENGINE (V 0.4)Jee Vang, Ph.D.vangjee@gmail.com
1A Quick Tutorial on Mahout's Recommendation Engine is
licensed under a Creative Commons Attribution 3.0 Unported License.
Slide Version 3.1
What is recommendation? Recommendation involves the prediction of
what new items a user would like or dislike based on preferences of or associations to previous items
(Made-up) Example: A user, John Doe, likes the following books (items):
A Tale of Two Cities The Great Gatsby For Whom the Bell Tolls
Recommendations will predict which new books (items), John Doe, will like: Jane Eyre The Adventures of Tom Sawyer
2
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What is Mahout? Mahout is a machine learning application
programming interface (API) built on Hadoop MapReduce (MR or M/R) Hadoop Distributed File System (HDFS)
Mahout is written in Java Mahout has machine learning algorithms in the
following areas: Clustering Pattern mining Classification Regression Evolutionary algorithms Recommenders/Collaborative filtering
3
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
How does Mahout’s Recommendation Engine Work?
X =
S U RS is the similarity matrix between itemsU is the user’s preferences for itemsR is the predicted recommendations
4
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What is the similarity matrix, S?
S is a n x n (square) matrix Each element, e, in S are
indexed by row (j) and column (k), ejk
Each ejk in S holds a value that describes how similar are its corresponding j-th and k-th items
In this example, the similarity of the j-th and k-th items are determined by frequency of their co-occurrence (when the j-th item is seen, the k-th item is seen as well) In general, any similarity
measure may be used to produce these values
We see in this example that Items 1 and 2 co-occur 3 times, Items 1 and 3 co-occur 4 times, and so on…
S
Item 1
Item
1
Item 2Item 3
Item 4Item 5Item 6Item 7
Item
2
Item
3
Item
4Ite
m 5
Item
6
Item
7
5
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What is the user’s preferences, U?
The user’s preference is represented as a column vector Each value in the
vector represents the user’s preference for j-th item
In general, this column vector is sparse
Values of zero, 0, represent no recorded preferences for the j-th item
U
Item 1Item 2Item 3
Item 4Item 5Item 6Item 7
6
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What is the recommendation, R?
R is a column vector representing the prediction of recommendation of the j-th item for the user
R is computed from the multiplication of S and U S x U = R
In this running example, the user already has expressed positive preferences for Items 1, 4, 5 and 7, so we look at only Items 2, 3, and 6
We would recommend to the user Items 3, 2, and 6, in this order, to the user
R
Item 1Item 2Item 3
Item 4Item 5Item 6Item 7
7
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What data format does Mahout’s recommendation engine expects?
For Mahout v0.4, look at RecommenderJob
(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob)
Each line of the input file should have the following format userID,itemID[,preferenceva
lue] userID is parsed as a long itemID is parsed as a long preferencevalue is parsed as
a double and is optional
Format 1123,345123,456123,789…789,458
Format 2123,345,1.0123,456,2.2123,789,3.4…789,458,1.2
8
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
How do you run Mahout’s recommendation engine? Requirements
Hadoop cluster on GNU/Linux Java 1.6.x SSH
Assuming you have a Hadoop cluster installed and configured correctly with the data loaded into HDFS, $HADOOP_INSTALL$/bin/hadoop jar $TARGET$/mahout-core-0.4-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=$INPUT$ -Dmapred.output.dir=$OUTPUT$ $HADOOP_INSTALL$ is the location where you installed Hadoop $TARGET$ is the directory where you have the Mahout jar file $INPUT$ is the input file name $OUTPUT$ is the output file name
There are plenty of runtime options (check javadocs) --userFile (path) : optional; a file containing userIDs; only preferences of these userIDs will be computed --itemsFile (path) : optional; a file containing itemIDs; only these items will be used in the
recommendation predictions --numRecommendations (integer) : number of recommendations to compute per user; default 10 --booleanData (boolean) : treat input data as having no preference values; default false --maxPrefsPerUser (integer) : maximum number of preferences considered per user in final
recommendation phase; default 10 --similarityClassname (classname): similarity measure (cooccurence, euclidean, log-likelihood,
pearson, tanimoto coefficient, uncentered cosine, cosine)
9
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
What are the mechanics of Mahout’s recommendation engine? Mahout is built on Hadoop’s MapReduce (MR) API
<K1,V1> map <K2,V2> <K2,List(V2)> reduce <K3,V3>
A series of MR phases (Jobs) are called to accomplish the task of predicting recommendations ItemIDIndexMapper, ItemIDIndexReducer ItemPrefsMapper,ToUserVectorReducer CounterUsersMapper,CounterUsersReducer … PartialMultiplyMapper,AggregateAndRecommendRed
ucer
10
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
Mahout’s Recommender Engine:Phase 1, Generate List of ItemIDs Input:
<LongWritable,Text> Output:
<VarIntWritable,VarLongWritable> Parses out itemIDlong Converts itemID to
int, itemIDint Emits
<itemIDint,itemIDlong>
Input: <VarIntWritable,List(VarLongWritable)>
Output: <VarIntWritable,VarLongWritable>
Find the smallest value in the list of values, itemIDlongmin
Emits <itemIDint, itemIDlongmin >
11
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
ItemIDIndexMapper ItemIDIndexReducer
Mahout’s Recommender Engine:Phase 2, Create Preference Vector Input:
<LongWritable,Text> Output:
<VarLongWritable,VarLongWritable>
Parses out userID and itemID
Emits <userID,itemID>
Input: <VarLongWritable,List(VarLongWritable
)>
Output: <VarLongWritable,VectorWritable>
Creates preferences, U U is a sparse Vector
Emits <userID, U>
12
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
ToItemPrefsMapper ToUserVectorReducer
Mahout’s Recommender Engine:Phase 3, Count Unique Users
Input: <LongWritable,Text>
Output: <CountUsersKeyWritable,VarLongWri
table>
Parses out userID Emits <userID,userID>
Input: <CountUsersKeyWritable,List(VarLongWrit
able)>
Output: <VarIntWritable,NullWritable>
Count all unique users, numUsers
Emits <numUsers, null>
13
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
CountUsersMapper CountUsersReducer
Mahout’s Recommender Engine:Phase 4, Transpose Preferences Vectors Input:
<VarLongWritable,VectorWritable>
Uses MR output from Phase 2 Output:
<IntWritable,DistributedRowMatrix.MatrixEntryWritable>
Transposes MR output from Phase 2 MR Phase 2 output had users as
rows and items as cols Now, items are rows and users are
cols Each element, ejk, is transposed,
ekj Emits <k,ekj>
Input: <IntWritable,List(DistributedRowMatrix.MatrixEntryWrit
able)>
Output: <IntWritable,VectorWritable>
Writes transposed user preferences vectors, U’
Emits <row, U’>
14
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
MaybePruneRowsMapper ToItemVectorsReducer
Mahout’s Recommender Engine:Phase 5.1, RowSimilarityJob, Compute Weights Input:
<IntWritable,VectorWritable>
Uses MR output from Phase 4
Output: <VarIntWritable,WeightedOccuren
ces> For each element, ejk, compute its weighted
occurrence, wjk Emits <k,wjk>
Input: <VarIntWritable,List(WeightedOccurrences)>
Output: <VarIntWritable,WeightedOccurrenceArray>
Transfers weighted occurrences to array and writes results
Emits <k, wjk>
15
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
RowWeightMapper WeightedOccurrencesPerColumnReducer
Mahout’s Recommender Engine:Phase 5.2, RowSimilarityJob, Compute Similarities Input:
<VarIntWritable,WeightedOccurrenceArray>
Uses MR output from Phase 5.1
Output: <WeightedRowPair,Coocurrenc
e> For pair of rows, p,
write its column coocurrences, c
Emits < p, c>
Input: <WeightedRowPair,List(Coocurrenc
e)> Output:
<SimilarityMatrixEntryKey,MatrixEntryWritable> Compute the row
similarities between rowa and rowb, and write corresponding position in the matrix
Emits <rowj, matrix entry>
16
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
CooccurrencesMapper SimilarityReducer
Mahout’s Recommender Engine:Phase 5.3, RowSimilarityJob, Similarity Matrix Input:
<SimilarityMatrixEntryKey,MatrixEntryWritable>
Uses MR output from Phase 5.2
Output: <SimilarityMatrixEntryKey,MatrixEntryWrit
able> Writes similarity matrix entry
key, sme, and matrix entry, me, as is
sme is basically each row me is basically each row-col
entry of the similarity matrix Emits <sme,me>
Input: <SimilarityMatrixEntryKey,List(MatrixEntryW
ritable)>
Output: <IntWritable,VectorWritable>
Write the row and its associated vector out
Emits <row, vector>
17
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
Mapper EntriesToVectorsReducer
Mahout’s Recommender Engine:Phase 6, Pre-partial multiply, Similarity Matrix
Input: < IntWritable,VectorWritable> Uses MR output from Phase 5.3
Output: <IntWritable,VectorOrPrefWritable
> Wraps the similarity
vector, v1, into a different vector format, v2
Emits <row,v2>
Input: <IntWritable,List(VectorOrPrefWritable)>
Output: <IntWritable,VectorOrPrefWritable>
Write the row and each of its associated vector out
Emits <row, vector>
18
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
SimilarityMatrixRowWrapperMapper Reducer
Mahout’s Recommender Engine:Phase 7, Pre-partial multiply, Preferences Input:
< VarLongWritable,VectorWritable>
Uses MR output from Phase 2
Output: < VarIntWritable,VectorOrPrefWritable>
Maps userID and preference vector, U
Emits <userID,U>
Input: <IntWritable,List(VectorOrPrefWritable)>
Output: <IntWritable,VectorOrPrefWritable>
Write the row and each of its associated vector out
Emits <row, vector>
19
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
UserVectorSplitterMapper Reducer
Mahout’s Recommender Engine:Phase 8, Partial Multiply
Input: < VarLongWritable,VectorWritable> Uses MR outputs
from Phases 6 and 7 Output:
< VarIntWritable,VectorOrPrefWritable>
Maps row and vector, v Emits <row,v>
Input: <VarIntWritable,List(VectorOrPrefWritable)>
Output: <IntWritable,VectorOrPrefWritable>
Write the row and each of its associated vector similarity, userIDs, and preference values
Emits <row, vector>
20
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
Mapper ToVectorAndPrefReducer
Mahout’s Recommender Engine:Phase 9, Filters Items
Input: <LongWritable,Text>
Output: <VarLongWritable,VectorLongWritab
le>
Parses userID and itemID
Emits <itemID,userID>
Input: <VarLongWritable,List(VarLongWrita
ble)>
Output: <VarIntWritable,VectorOrPrefWritable>
Writes itemID and vector of userIDs and preferences
Emits <itemID, vector>
21
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
ItemFilterMapper ItemFilterAsVectorAndPrefReducer
Mahout’s Recommender Engine:Phase 10, Aggregate and Recommend Input:
<VarIntWritable,VectorAndPrefsWritable>
Uses MR outputs from phases 8 and 9
Output: <VarLongWritable,PrefAndSimilarityColumnWritable>
Writes userID and recommendations
Emits <userID,recommendation>
Input: <VarLongWritable,List(PrefAndSimilarityColumnWrit
able)>
Output: <VarLongWritable,RecommendedItemsWrit
able>
Writes userID and vector of recommendations
Emits <userID, vector>
22
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
PartialMultiplyMapper AggregateAndRecommendReducer
Summary and Conclusion Mahout is a machine learning API built on
top of Hadoop which includes clustering, pattern mining, classification, regression, evolutionary algorithms, and recommenders
Mahout’s recommender engine transforms an expected input format into predicted recommendations Uses a series of MR phases to accomplish
predicting recommendations
23
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.
References S. Owen, R. Anil, T. Dunning, E. Friedman.
Mahout in Action. MEAP: Manning Publications, 2010.
T. White. Hadoop: The Definitive Guide. Sebastopol, CA: O’Reilly Media, Inc., 2009.
J. Venner. Pro Hadoop. Berkely, CA: Apress, 2009.
C. Lam. Hadoop in Action. Stamford, CT: Manning Publications Co., 2011.
24
A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0
Unported License.