WOOster: A Map-Reduce based Platform for Graph Mining
-
date post
18-Oct-2014 -
Category
Technology
-
view
1.485 -
download
0
description
Transcript of WOOster: A Map-Reduce based Platform for Graph Mining
![Page 1: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/1.jpg)
WOOster: A Map-Reduce based Platform for Graph Mining
Aravindan RaghuveerYahoo! Inc, Bangalore.
![Page 2: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/2.jpg)
Yahoo! Confidential
2
Introduction
“If you squint the right way, graphs are everywhere” [1]
@ Yahoo! :• The WOO Graph: All knowledge
assimilated from the web.- http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
pers/Industry/WOO_ISWC.pptx
[1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
![Page 3: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/3.jpg)
Yahoo! Confidential
What?
Why?
Family of Graph Query Algorithms.• Framework:
• For graph storage and invoking the query algorithms• Hosted Solution on Hadoop
• Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs.• Framework:
• Optimizes storage layout to suit graph query algorithms
• Improves throughput of the queries.
The What and Why?
3
Why?
What?
![Page 4: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/4.jpg)
Yahoo! Confidential
Outline of the talk
• MapReduce 101• Graph Mining Approaches• Brief overview of WOOster architecture• Graph query algorithms in WOOster:
• Sub Graph Matching• Reachability Query
• Experiments• Conclusion
![Page 5: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/5.jpg)
Yahoo! Confidential
5
Map Reduce 101
Switch to slides from Cloud Computing with MapReduce and Hadoop
www.cs.berkeley.edu/~matei/talks/2009/parlab_bootcamp_clouds.ppt
![Page 6: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/6.jpg)
MapReduce Programming Model
• Data type: key-value records
• Map function:
(Kin, Vin) list(Kinter, Vinter)
• Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
![Page 7: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/7.jpg)
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
![Page 8: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/8.jpg)
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
![Page 9: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/9.jpg)
Yahoo! Confidential
9
Graph Mining Approaches : Two Schools School-1: Invent a new platform:
- Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s- Pregel, Haloop from Google [1]
School-2: Ride on Map-Reduce- MR has wide adoption, open source tools, industry support.- Invest on one more computing infrastructure- Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)- Efforts in open source / academia on the same lines:
• Pegasus CMU [2]• Graph Mining in Apache Mahout[3]• Rayethon’s Graph Mining [4]
[1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184[2] http://www.cs.cmu.edu/~pegasus/[3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache[4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
![Page 10: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/10.jpg)
Yahoo! Confidential
WOOster Architecture
• User submits a query • Planner periodically scans for
newly arrived queries.• Planner creates a M-R plan that
re-uses computation, / IO across queries. (Batching)
• Executor executes the M-R plan.
• Result notified to the user (Hosted Solution)
WOOster Web UI & WebService APIs
Planner
Executor
Grid
JobsD/B
GraphIndices
WOO Graph
![Page 11: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/11.jpg)
Yahoo! Confidential
Why Sub-Graph Match (Exact Graph Isomorphism)?:
A popular and expressive graph query useful to mine patterns.
To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present.
The Sub-Graph Match Query
Find all instances of query Q
graph G
Vertices have attributes (ex age:31)
Edges have relationship labels.
Query Vertex Graph Vertex A matched graph vertexNotation
Vertices and edges have constraints (ex: age<40)
in
![Page 12: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/12.jpg)
Yahoo! Confidential
Overview of the Solution
Step-1. Query Graph Partitioning
Step-2. Edge Selection
Step-3. Query Partition Matching
Step-4. Query Partition Merging
Step-0. Data Layout on HDFS
![Page 13: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/13.jpg)
Yahoo! Confidential
Data Layout on HDFS
• How to store a large scale graph?• Adjacency List like solution:
• Each row/line has information about a vertex:• Vertex attributes• Vertex neighbors and the labels associated with each edge.
Implications:•Enables early pruning of non-matching edges and vertices.•Each vertex has information about itself and its immediate neighbors only.
![Page 14: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/14.jpg)
Yahoo! Confidential
Step-1: Query Graph Partitioning
Why?: Parallelized solving of independent sub-problems
How?Find minimum number of partitions such that diameter of partition = 2.
Intuition:•In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices pivot vertex•Will use this property in steps 2, 3.
Pivot Vertices
![Page 15: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/15.jpg)
Yahoo! Confidential
Step-2: Edge Selection• What: Select a subset of edges from G that match atleast one
edge in Q.• How:
g1
g2
g4
g3MapLogic
g1 g2 ReduceLogic
g1 g2
g1
g1:Current vertex in mapper.
1.Mapper emits all
edges if vertex and edge constraints are
met
2a.
g1-g2 emitted:g1 mapped to a
query vertex.
3.
g1-g2 emited from g2’s mapper
4.Reducer emits
an edge if a pair is found
5.For every neigbor of q1, there exists a
corresponding neighbor for g1
2b.
![Page 16: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/16.jpg)
Yahoo! Confidential
Step-3: Query Partition MatchingEdge Selection:
• Associates a graph vertex to the possible query vertices it could map to• Associates the graph vertex to its “pivot” graph vertex.• Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example
Edge Selection
outputMapLogic
g1 g2
g1 g3
g1 g4
ReduceLogic
g1
g2
g4
g3
Mapper emits pivot graph vertex as key and edge as
value 1.Reducer receives all edges with the same
pivot graph vertex
2.
Reducer forms the partition
3.
g1 g2
![Page 17: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/17.jpg)
Yahoo! Confidential
Step-4: Query Partition Merging
• Merges partitions one after another to form the a query match• More details in paper.
Take-away from Steps1-4: (also for any scalable Map-Reduce program)
The mapper/reducer keys are chosen such that: # keys is proportional to the number of matches of query Q in the graph. Hence the algorithm scales well for large graphs and complex queries.
![Page 18: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/18.jpg)
Yahoo! Confidential
Results
Graph of 10 million vertices and 50 million edges Complex Query of 24 vertices Note that the edge selection time reduces with
increasing number of reducers.
0
20
40
60
80
100
120
140
160
100 150 200 250
Number of Reducers
Tim
e (s
ec)
Edge Selection Query Partition Matching Query Partition Merging
![Page 19: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/19.jpg)
Yahoo! Confidential
In the paper…
Detailed map-reduce algorithms for sub-graph match and reachability
Theoretical analysis for scalability Construction of the synthetic dataset Methodology and more experiments. Reachability query: examples, map-reduce algorithm Related work
![Page 20: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/20.jpg)
Yahoo! Confidential
Future Work
• Indexing structure for graphs suited for M-R jobs• Compare with giraph based approach.
• Better batching strategies.• Right interface for custom graph algorithms to be
plugged in while WOOster providing automatic batching.
• More graph mining algorithms implemented
![Page 21: WOOster: A Map-Reduce based Platform for Graph Mining](https://reader033.fdocuments.us/reader033/viewer/2022061105/54431d04afaf9fe3098b4748/html5/thumbnails/21.jpg)
Yahoo! Confidential
21
Questions / Comments