Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc, [email protected].
-
Upload
yvonne-collis -
Category
Documents
-
view
215 -
download
1
Transcript of Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc, [email protected].
Intuit, Graphs and Me
Me: Large-scale graph data processing,
complex networks analysis, graph algorithms …
Intuit: QuickBooks, TurboTax, Mint.com,
GoPayment, …
Graphs @ Intuit: Commercial Graph is the business
“social network”
2
B1
B2 C1
My Goals for this Talk
You leave with your inner computer scientist tantalized: There is more to writing efficient Map-Reduce algorithms
than counting words and merging logs
You get a general sense of the state of the research
I convince you of the need for a real graph processing package for Hadoop
You know a bit about our work at Intuit
Plan
Jump right to it with an example (enumerating triangles)
Define the performance metrics (what are we optimizing for?)
Give a classification of known “recipes”
The triangle example with with a new trick
Personalized PageRank, connected components
A list of other algorithms
4
Finding Triangles with Map-Reduce
1 2
3 4
1 3
2 3
2 4
3 4
34
4
322
24
31
1
3
5 Potential Triangles to Consider
Another round of Map Reduce jobs will check for the existence
of the “closing” edge
Step 1: Key edges by both end nodes
Step 2: Emit potential triangles
Problems with this Approach
1. Each triangle will be detected 3 times – once under each of its 3 vertices
2. Too many “potential” triangles are created in the first reduce step.
For a node with degree d:
Total # of records:
6
Modified Algorithm [Cohen ‘08]
1 2
3 4
1 3
2 3
2 4
3 4
34
24
3
1
3
Step 1: Only under smaller node
Step 2: Emit potential triangles
For each triangle exactly one potential triangle is created (under
the lowest value node)
The quadratic problem still persists
This is neat. At least we are not triple counting
But the quadratic problem still exists. The number of records is still O(N<k2>)
We want to avoid binning edges under high degree nodes
The ordering of nodes is arbitrary! Let the degree of a node define its order.
8
Bin an edge under it’s LOW DEGREE node
Break ties arbitrarily, but consistently
3 2
1 4
5
1 4
5 3
2
The performance
Worst case: records vs. The same as the best serial algorithm [Suri ‘11]
The gain for “real” graphs is fairly substantial. If a graph is reasonably random, it cuts down to: vs.
For a heavy-tailed social graph (like our Commercial Graph), this can be fairly huge
9
Enumerating Rectangles
Triangles will tell you the friends you have in common with another friend
“People you May Know”: Find another node, not connected to you, who has many friends in common with you. That node is a good candidate for “friendship”.
Basis of User Based or Content Based collaborative filtering If the graph is bi-partite
10
Generalization to Rectangles
11
There are 4 classes for a rectangle: requires a bit more work
2
3
4
1
3
2
4
1
2
4
3
1
A
B C
Ordering triangle nodes has a unique equivalency class
Performance Metrics
Computation: Total computation in all mappers and reducers
Communication: How many bits are shuffled from the mapper to the
reducer
Number of map-reduce steps: You can work it into the above The overhead of running jobs
12
“Recipes” for Graph MR Algorithms
Roughly two classes of algorithms:
1. Partition-Compute then Merge Create smaller sub-graphs that fit into a single memory Do computation on the small graphs Construct the final answer from the answers to the small
sub-problems
2. Compute-in-Parallel then Merge
13
Partition-Compute-Merge
14
Finding Triangles By Partitioning [Suri ‘11]
1. Partition the nodes into b sets:
2. For every 3 sets
create a reducer.
3. Send an edge to iff both its ends are in
4. Detect triangles using a serial algorithm within each reducer
15
b=4, V1={1}, V2={2}, V3={3}, V4={4},
1 2
3 4
1 3
2 3
2 4
3 4V1,2,3 V1,3,4 V2,3,4
3 4
2
3 43
1 21
Analysis
Every triangle is detected. All 3 vertices are guaranteed to be in at least one partition
Average # edges in each reducer is
Use an optimal serial triangle finder at each reducer. The total amount of work at all reducers is:
# of edges sent from the mappers to reducers (communication cost) is
17
One Problem
Each triangle may be detected multiple times. If all three vertices are mapped to the same partition, it will be detected times
This can be fixed with a similar ordering-of-nodes trick [Afrati ’12]
Can be generalized to detect other small graph structures efficiently [Afrati ‘12]
18
Minimum Weights Spanning Tree
1. Partition the nodes into b sets
2. For every pair of sets create a reducer
3. Send all edges that have both their ends in one pair to the corresponding reducer
4. Compute the minimum spanning tree for the graph in each reducer. Remove other edges to sparsify the graph
5. Compute the MST for the sparsified graph
19
Compute-in-parallel and merge
20
Personalized PageRank
Like the global PageRank: But the random walker that comes back to where it
started with probability d
For every v you will have a personalized page rank vector of length N. We usually keep only a limited number of top personalized
PageRanks for each node.
It finds the influential nodes in the proximity of a given node.
21
Monte Carlo Approximation
Simulate many random walks from every single node. For each walk:
1. A walk starting from node v is identified by v Keep track of <v,Uv,t> where Uv,t is the current end point
at step t for the walk starting at node v
2. In each Map-Reduce step advance the walk by 1 step Pick a random neighbor of Uv,t
3. Count the frequency of visits to each node
22
One can do better [Das Sarma ‘08]
This takes T steps for a walk of length T
We can cut it down to T1/2 by a simple “stitching” idea
1. Do T/J random walks from every node for some J
2. To for a walk of length T, pick one of the T/J segments at random and jump to the end of the segment
3. Pick another random segment, etc
4. If you arrive at a node twice, do not use the same segment (that’s why you need T/J segments)
Total iterations: J+T/J minimized when J=T1/2 O(T1/2)
23
Exponential speed up [Bahmani ‘11]
The stitching was done somewhat serially (at each step, one segment was stitched to another)
Idea: Stich recursively, which will result in exponentially expanding the walk/segment ratio
Takes a little more tricks to make it work, but you can bring it down to O(log T)
24
Labeling Connected Components
Assign the same ID to all nodes inside the same component
25
1 2
3 4
5
6
How do we do it on one machine?
1. i=1
2. Pick a random node you have not picked before, assign it id=i and put it in a stack
3. Pop a node from the stack, pull all it’s neighbors we have not seen before into the stack. Assign them id=i
4. If stack is not empty go to 3, otherwise i i+1 and go to 2
Time and memory complexity O(M).
26
1 2
3 4
5
6
In Map-Reduce: More Parallelizim Instead of growing a frontier zone from a single seed, start
growing it from all nodes. When two zones meet, merge them
27
1 432
Edge File
<v1,v2>
<v2,v3>
<v3,v4>
Zone File
<v1,z1>
<v2,z2>
<v3,z3>
<v4,z4>
Game Plan28
<v1,v2>
<v1,z1><[v1,v2],z1>
<v2,v1>
<v2,v3>
<v2,z2>
<[v1,v2],z2>
<[v2,v3],z2>
<v3,v2>
<v3,v4>
<v3,z3>
<[v2,v3],z3>
<[v3,v4],z3>
<v4,v3>
<v4,z4><[v3,v4],z4>
<[v1,v2],z1>
<[v1,v2],z2>
<[v2,v3],z2>
<[v2,v3],z3>
<[v3,v4],z3>
<[v3,v4],z4>
<z2,z1>
<z3,z2>
<z4,z3>
<z2,v2>
<z2,z1>
<z3,v3>
<z3,z2>
<z4,v4>
<z4,z3>
<z2,v2>
<z2,z1>New Zone File
<v1,z1>
<v2,z1>
<v3,z2>
<v4,z3>
Bin Zone and Edge by Node
Bin edge to zone map
Collect over edges
A zone to zone map
Reconcile zones
Reassign zones to nodes
1 432
Analysis
Communication: O(M+N)
Number of rounds: O(d) where d is the diameter of the graph. Most real graphs have small diameters. Random graph: d=O(log N) This works worst for a “path-graph”
An algorithm with O(M+N) communication and O(log n) round exists for all graphs [Rastogi ’12] Uses an idea similar to MinHash
29
Intuit’s GraphEdge
A (hopefully soon to be open sourced) graph processing package for Hadoop built on Cascading
Efficient support of many core graph processing algorithms: State of the art algorithms Industry-grade test for scalability
Will take a few more months to release.
Would love to gauge your interest
30
Intuit’s Commercial Graph
Think of a graph in which a node is a business, or a consumer
An edge is a transaction between these entities
The entities are either direct clients of Intuit’s many offerings, or are business partners of Intuit’s clients
We experiment with a “toy” version of this graph: about 200M nodes and 10B edges.
31
References Cohen, Jonathan. "Graph twiddling in a MapReduce world."
Computing in Science & Engineering 11.4 (2009): 29-41. Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and
the curse of the last reducer." Proceedings of the 20th international conference on World wide web. ACM, 2011.
Bahmani Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." Proceedings of the 37th SIGMOD international conference on Management of data. 2011.
A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In PODS, pages 69–78, 2008.
Foto N. Afrati, Dimitris Fotakis, Jeffrey D. Ullman, Enumerating Subgraph Instances Using Map-Reduce. http://arxiv.org/abs/1208.0615 2012
Lattanzi, Silvio, et al. "Filtering: a method for solving graph problems in mapreduce.” 2011.
32