Recommendation and graph algorithms in Hadoop and SQL
-
Upload
david-gleich -
Category
Technology
-
view
724 -
download
1
description
Transcript of Recommendation and graph algorithms in Hadoop and SQL
![Page 1: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/1.jpg)
Recommendation and graph algorithms in Hadoop and SQL
DAVID F. GLEICH ASSISTANT PROFESSOR "COMPUTER SCIENCE "PURDUE UNIVERSITY
David Gleich · Purdue 1
Code github.com/dgleich/matrix-hadoop-tutorial
Ancestry.com
@dgleich [email protected]
![Page 2: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/2.jpg)
Matrix computations
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
Least squares Eigenvalues
Ax Ax = b min kAx � bk Ax = �x
Operations Linear "systems David Gleich · Purdue 2 Ancestry.com
![Page 3: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/3.jpg)
Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info
David Gleich · Purdue 3 Ancestry.com
![Page 4: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/4.jpg)
matrix computations "≠"
linear algebra
David Gleich · Purdue 4 Ancestry.com
![Page 5: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/5.jpg)
World’s simplest recommendation system.
Suggest the average rating.
David Gleich · Purdue 5 Ancestry.com
![Page 6: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/6.jpg)
A SQL statement as a "matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
How do I find the average rating for each product?
David Gleich · Purdue 6 Ancestry.com
![Page 7: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/7.jpg)
A SQL statement as a "matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
SELECT ! p.product_id, ! p.name, ! AVG(pr.rating) AS rating_average!FROM products p !INNER JOIN product_ratings pr!ON pr.product_id = p.product_id!GROUP BY p.product_id!ORDER BY rating_average DESC !
How do I find the average rating for each product?
David Gleich · Purdue 7 Ancestry.com
![Page 8: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/8.jpg)
This SQL statement is a "matrix computation!
8 Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue
![Page 9: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/9.jpg)
SELECT ! ... ! AVG(pr.rating) !... !GROUP BY p.product_id!
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
David Gleich · Purdue 9 Ancestry.com
![Page 10: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/10.jpg)
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
But it’s a weird matrix"
Missing entries!
David Gleich · Purdue 10
Ancestry.com
![Page 11: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/11.jpg)
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4
Is a matrix!
pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9
4
4
4
4 5 4
But it’s a weird matrix"
Matrix
SELECT AVG(r) ... GROUP BY pid
Vector
Average"of ratings
David Gleich · Purdue 11
Ancestry.com
![Page 12: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/12.jpg)
But it’s a weird matrix"and not a linear operator
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
avg(A) =
2
6664
Pj A1,j/
Pj “A1,j 6= 0”P
j A2,j/P
j “A2,j 6= 0”...P
j Am,j/P
j “Am,j 6= 0”
3
7775
David Gleich · Purdue 12
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
Is a matrix!
Ancestry.com
![Page 13: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/13.jpg)
matrix computations "≠"
linear algebra
David Gleich · Purdue 13
Ancestry.com
![Page 14: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/14.jpg)
Hadoop, MapReduce, and Matrix Methods
David Gleich · Purdue 14
Ancestry.com
![Page 15: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/15.jpg)
MapReduce
David Gleich · Purdue 15
dataMap
dataMap
dataMap
dataMap
keyvalue
keyvalue
keyvalue
keyvalue
keyvalue
keyvalue
()
Shuffle
keyvaluevalue
dataReduce
keyvaluevaluevalue
dataReduce
keyvalue dataReduce
Ancestry.com
![Page 16: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/16.jpg)
The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.
Express algorithms in "“data-local operations”. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
Data scalable
Fault-tolerance by design
16
David Gleich · Purdue Ancestry.com
![Page 17: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/17.jpg)
wordcount "is a matrix computation too
map(document) :
for word in document
emit (word, 1)
reduce(word, counts) :
emit (word, sum(counts))
1 2 D D
3 4 D D
5 D
matrix,1 matrix,1 matrix,1 matrix,1
bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1
hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1
David Gleich · Purdue 17
Ancestry.com
![Page 18: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/18.jpg)
wordcount "is a matrix computation too
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
doc1
doc2
docm
= A
colsum(A) = AT e word count = e is the vector of all ones
David Gleich · Purdue 18
Ancestry.com
![Page 19: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/19.jpg)
inverted index"is a matrix computation too
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
doc1
doc2
docm
= A
David Gleich · Purdue 19
Ancestry.com
![Page 20: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/20.jpg)
2
66664
A1,1 A2,1 · · · Am,1
A1,2 A2,2 · · ·...
.... . .
. . . Am,n�1A1,n · · · Am�1,n Am,n
3
77775= AT
term1
term2
termm
inverted index"is a matrix computation too
David Gleich · Purdue 20
Ancestry.com
![Page 21: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/21.jpg)
A recommender system "with social info
David Gleich · Purdue 21
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
Ancestry.com
![Page 22: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/22.jpg)
A recommender system "with social info
David Gleich · Purdue 22
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
pid1
pid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75uid1
uid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75
Ancestry.com
![Page 23: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/23.jpg)
A recommender system "with social info
David Gleich · Purdue 23
product_ratings
pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
friends_links
uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1
R S
Ancestry.com
![Page 24: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/24.jpg)
A recommender system "with social info
David Gleich · Purdue 24
Recommend each item based on the average rating of all trusted users
“X = S RT” with something that is"almost a matrix-matrix"product
R pid1
pid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75 S uid1
uid2
2
64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...
. . .. . .
3
75
Xuid,pid =
X
uid2
Suid,uid2Ruid2,pid
!· X
uid2
“Suid,uid2 and Ruid2,pid 6= 0”
!�1
Ancestry.com
![Page 25: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/25.jpg)
Tools I like
hadoop streaming dumbo mrjob hadoopy C++
David Gleich · Purdue 25
Ancestry.com
![Page 26: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/26.jpg)
Tools I don’t use but other people seem to like …
pig java hbase mahout Eclipse Cassandra
David Gleich · Purdue 26
Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy
Ancestry.com
![Page 27: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/27.jpg)
hadoop streaming
the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout
David Gleich · Purdue 27
Ancestry.com
![Page 28: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/28.jpg)
mrjob from
a wrapper around hadoop streaming for map and reduce functions in python
class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
David Gleich · Purdue 28
Ancestry.com
![Page 29: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/29.jpg)
Connected components in SQL and Hadoop
Ancestry.com David Gleich · Purdue 29
![Page 30: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/30.jpg)
Connected components
Ancestry.com David Gleich · Purdue 30
3 “components” in this graph How can we find them algorithmically … … on a huge network?
![Page 31: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/31.jpg)
Connected components
Ancestry.com David Gleich · Purdue 31
Algorithm!Assign each node a random component id. For each node, take the minimum component id of itself and all neighbors.
![Page 32: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/32.jpg)
DEMO
Ancestry.com David Gleich · Purdue 32
![Page 33: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/33.jpg)
Computing Connected Components in SQL
Graph!Edges : id | head | tail !
!“Vector”!v : id | comp ! initialized to random ! component!
Ancestry.com David Gleich · Purdue 33
!CREATE TABLE v2 AS ( !SELECT ! e.tail AS id, ! MIN(v.comp) as COMP !FROM edges e !INNER JOIN vector v !ON e.head = v.id!GROUP BY e.tail!); !!DROP TABLE v; !ALTER TABLE v2 ! RENAME TO v; !!... Repeat ... !!
![Page 34: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/34.jpg)
Matrix-vector product and connected components in Hadoop
David Gleich · Purdue 34
Ax = y
y
i
=X
k
A
ik
x
k
A x
See example! ���matrix-hadoop/codes/smatvec.py!
Ancestry.com
Google’s PageRank Word count, average rating!
“AT
x = y”y
i
= min(xi
, mink
A
ki
x
k
)
Connected components
![Page 35: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/35.jpg)
Matrix-vector product
David Gleich · Purdue 35
Ax = y
y
i
=X
k
A
ik
x
k
A x
A is stored by “node”
$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !3 4 -1.45 !4 2 0.061 !
v initially random !
$ head samples/vec_5.txt !0 0.241 !1 -0.98 !2 0.237 !3 -0.32 !4 0.080 !
Follow along! ���matrix-hadoop/codes/smatvec.py!
Ancestry.com
![Page 36: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/36.jpg)
Matrix-vector product"(in pictures)
David Gleich · Purdue 36
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x Reduce 2!Output sum(Aik xk)"
y
Ancestry.com
![Page 37: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/37.jpg)
Matrix-vector product"(in pictures)
David Gleich · Purdue 37
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # xi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (vals[i], # column ! (row, # i,Aij! float(vals[i+1]))) !
Ancestry.com
![Page 38: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/38.jpg)
“Matrix-vector” for connected components
David Gleich · Purdue 38
A x
A x
Input Map 1!Align on columns"
def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # vi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (row, # head ! (vals[i], # tail)) !
Ancestry.com
“AT
x = y”y
i
= min(xi
, mink
A
ki
x
k
)
![Page 39: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/39.jpg)
Matrix-vector product"(in pictures)
David Gleich · Purdue 39
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], val[1]*vecval) !
Note that you should use a secondary sort to avoid reading both in memory
Ancestry.com
![Page 40: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/40.jpg)
“Matrix-vector” for connected components
David Gleich · Purdue 40
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], vecval) !
Note that you should use a secondary sort to avoid reading both in memory
Ancestry.com
“AT
x = y”y
i
= min(xi
, mink
A
ki
x
k
)
![Page 41: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/41.jpg)
Matrix-vector product"(in pictures)
David Gleich · Purdue 41
Ax = y
y
i
=X
k
A
ik
x
k
A x
A x
Input Map 1!Align on columns"
Reduce 1!Output Aik xk"keyed on row i
A
x Reduce 2!Output sum(Aik xk)"
y
def sumred(self, key, vals): ! yield (key, sum(vals)) !
Ancestry.com
![Page 42: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/42.jpg)
Our social recommender
David Gleich · Purdue 42
RT S
Follow along! ���matrix-hadoop/recsys/recsys.py!
R is stored entry-wise !
$ gunzip –c data/rating.txt.gz!139431556 591156 5 !139431556 1312460676 5 !139431556 204358 4 139431556 368725 5 !Object ID! User ID! Rating!
S is stored entry-wise !
$ gunzip –c data/rating.txt.gz!3287060356 232085 -1 !3288305540 709420 1 !3290337156 204418 -1 !3294138244 269243 -1 !Other ID! Trust!My ID!
Ancestry.com
![Page 43: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/43.jpg)
Matrix-matrix product
David Gleich · Purdue 43
A B
Follow along! ���matrix-hadoop/codes/matmat.py!
AB = CCij =
X
k
Aik Bkj
Ancestry.com
Conceptually, the first step is the same as the matrix-vector product with a block of vectors.
![Page 44: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/44.jpg)
Matrix-matrix product "(in pictures)
David Gleich · Purdue 44
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B Reduce 2!Output sum(Aik Bkj)"
C
Ancestry.com
![Page 45: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/45.jpg)
Social recommender "(in code)
David Gleich · Purdue 45
A B
A Map 1!Align on columns"
B
def joinmap(self, key, line): ! parts = line.split('\t') ! if len(parts) == 8: # ratings ! objid = parts[0].strip() ! uid = parts[1].strip() ! rat = int(parts[2]) ! yield (uid, (objid, rat)) ! else len(parts) == 4: # trust ! myid = parts[0].strip() ! otherid = parts[1].strip() ! value = int(parts[2]) ! if value > 0: ! yield (otherid, (myid,)) !
Ancestry.com
![Page 46: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/46.jpg)
Matrix-matrix product "(in pictures)
David Gleich · Purdue 46
A B
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B
def joinred(self, key, vals): ! tusers = [] # uids that trust key ! ratobjs = [] # objs rated by uid=key ! for val in vals: ! if len(val) == 1: ! tusers.append(val[0]) ! else: ! ratobjs.append(val) !! for (objid, rat) in ratobjs: ! for uid in tusers: ! yield ((uid, objid), rat) !
Conceptually, the second step
is the same as the matrix-
matrix product too, we “map”
the ratings from each trusted
user back to the source.
Ancestry.com
![Page 47: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/47.jpg)
Matrix-matrix product "(in pictures)
David Gleich · Purdue 47
A B
AB = CCij =
X
k
Aik Bkj
A Map 1!Align on columns"
B Reduce 1!Output Aik Bkj"keyed on (i,j)
A
B Reduce 2!Output sum(Aik Bkj)"
C def avgred(self, key, vals): ! s = 0. ! n = 0 ! for val in vals: ! s += val! n += 1 ! # the smoothed average of ratings ! yield key, ! (s+self.options.avg)/float(n+1) ! !
Ancestry.com
![Page 48: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/48.jpg)
Better ways to store "matrices in Hadoop
David Gleich · Purdue 48
A B
A B
Block matrices minimize the number of intermediate keys and values used. I’d form them based on the first reduce No need for “integer” keys that
fall between 1 and n!
Ancestry.com
![Page 49: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/49.jpg)
49
A
From tinyimages"collection
Ancestry.com
Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)
regression and general linear models"with many samples
block iterative methods panel factorizations
simulation data analysis !
big-data SVD/PCA!
Used in
David Gleich · Purdue
![Page 50: Recommendation and graph algorithms in Hadoop and SQL](https://reader033.fdocuments.us/reader033/viewer/2022052821/554a06f4b4c905557a8b55cd/html5/thumbnails/50.jpg)
Questions?
50
Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue