GraphX : Graph Analytics on Spark
description
Transcript of GraphX : Graph Analytics on Spark
![Page 1: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/1.jpg)
GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab
AMPCamp: August 29, 2013
![Page 2: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/2.jpg)
Graphs are Essential to Data Mining and Machine Learning
Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies
![Page 3: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/3.jpg)
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting Political Bias
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
??
?
?
??
?
? ??
?
?
??
? ?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
3
Conditional Random FieldBelief Propagation
![Page 4: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/4.jpg)
Triangle CountingCount the triangles passing through each vertex:
Measures “cohesiveness” of local community
More TrianglesStronger Community
Fewer TrianglesWeaker Community
12 3
4
![Page 5: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/5.jpg)
Collaborative FilteringRatings Item
sUser
s
![Page 6: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/6.jpg)
6
Many More Graph Algorithms
• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization– SVD
• Structured Prediction– Loopy Belief Propagation– Max-Product Linear
Programs– Gibbs Sampling
• Semi-supervised ML– Graph SSL – CoEM
• Graph Analytics– PageRank– Single Source Shortest Path– Triangle-Counting– Graph Coloring– K-core Decomposition– Personalized PageRank
• Classification– Neural Networks– Lasso…
![Page 7: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/7.jpg)
7
Dependency Graph
Table
Structure of Computation
Result
Data-Parallel Graph-Parallel
Row
Row
Row
Row
Pregel
![Page 8: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/8.jpg)
The Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edges
Using messages (e.g. Pregel [PODC’09, SIGMOD’10])Through shared state (e.g., GraphLab [UAI’10, VLDB’12])
Parallelism: run multiple vertex programs simultaneously
8
![Page 9: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/9.jpg)
By exploiting graph-structure
Graph-Parallel systems can be orders-of-
magnitude faster.
9
![Page 10: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/10.jpg)
Counted: 34.8 Billion Triangles
10
Triangle Counting on Twitter
64 Machines15 SecondsGraphLab
1536 Machines423 Minutes
Hadoop[WWW’11]
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
1000 x Faster
40M Users, 1.4 Billion Links
![Page 11: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/11.jpg)
Pregel
Specialized Graph Systems
![Page 12: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/12.jpg)
Specialized Graph Systems
1. APIs to capture complex graph dependencies
2. Exploit graph structure toreduce communicationand computation
![Page 13: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/13.jpg)
Why GraphX?
13
![Page 14: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/14.jpg)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Graph
Lab Hadoop Graph AlgorithmsGraph CreationPostProc
.
The Bigger Picture
Time Spent in Data Pipeline
![Page 15: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/15.jpg)
![Page 16: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/16.jpg)
Vertices
![Page 17: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/17.jpg)
Edges
Edges
![Page 18: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/18.jpg)
Limitations of Specialized Graph-Parallel Systems
No support for Construction & Post ProcessingNot interactive Requires maintaining multiple platforms
Spark excels at these!
![Page 19: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/19.jpg)
GraphX Unifies Data-Parallel and Graph-
Parallel Systems
Spark Table API
RDDs, Fault-tolerance, and task scheduling
GraphLabGraph API
graph representation and
execution
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Graph Construction ComputationPost-Processingone system for the entire graph pipeline
![Page 20: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/20.jpg)
Enable Joining Tables and Graphs
User Data
ProductRatings
Friend Graph
ETL
Product Rec.Graph
Join Inf.
Prod.Rec.
Tables Graphs
20
![Page 21: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/21.jpg)
The GraphX Resilient Distributed
GraphId
RxinJegonzalFranklinIstoica
SrcId DstIdrxin jegonzal
franklin
rxin
istoica franklinfrankli
njegonzal
R
J
F
IAttribute (E)
FriendAdvisor
CoworkerPI
Attribute (V)(Stu., Berk.)
(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)
![Page 22: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/22.jpg)
class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T, direction: EdgeDir):
Graph[T, E]}
GraphX API
![Page 23: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/23.jpg)
F
E
Aggregate NeighborsMap-Reduce for each vertex
D
B
A
C
mapF( )A B
mapF( )A C
a1
a2
reduceF( , )a1 a2 A
![Page 24: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/24.jpg)
F
E
Example: Oldest Follower
D
B
A
CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices
23 42
30
19 75
16
![Page 25: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/25.jpg)
We can express both Pregel and GraphLab using
aggregateNeighbors in 40 lines of code!
![Page 26: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/26.jpg)
Performance Optimizations
Replicate & co-partition vertices with edges
»GraphLab (PowerGraph) style vertex-cut partitioning
»Minimize communication by avoiding edge data movement in JOINs
In-memory hash index for fast joins
![Page 27: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/27.jpg)
Early Performance
GraphLab
GraphX
Hadoop
0 200 400 600 800 1000 1200 1400 1600
22
165
1340
Runtime (in seconds, PageRank for 10 iter-ations)
![Page 28: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/28.jpg)
In Progress Optimizations
Byte-code inspection of user functions»E.g. if mapf does not need edge data, we
can rewrite the query to delay the join
Execution strategies optimizer»Scan edges randomly accessing vertices»Scan vertices randomly accessing edges
![Page 29: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/29.jpg)
Current Implementation
Pregel (20)
PageRank (5)
GraphX
Spark (relational operators)
Connected
Comp. (10)
Shortest Path (10)
ALS(40)
GraphLab (20)
![Page 30: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/30.jpg)
DemoReynold Xin
![Page 31: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/31.jpg)
Summary1. Graph-parallel primitives on Spark.2. Currently slower than GraphLab, but
»No need for specialized systems»Easier ETL, and easier consumption of
output»Interactive graph data mining
3. Future work will bring performance closer to specialized engines.
![Page 32: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/32.jpg)
StatusCurrently finalizing the APIs
»Feedback wanted: http://bit.ly/graph-api
Also working on improving system performanceWill be part of Spark 0.9
![Page 34: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/34.jpg)
Backup slides
![Page 35: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/35.jpg)
Vertex Cut Partitioning
![Page 36: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/36.jpg)
Vertex Cut Partitioning
![Page 37: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/37.jpg)
aggregateNeighbors
![Page 38: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/38.jpg)
aggregateNeighbors
![Page 39: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/39.jpg)
aggregateNeighbors
![Page 40: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/40.jpg)
aggregateNeighbors
![Page 41: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/41.jpg)
Example: Vertex Degree
![Page 42: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/42.jpg)
Example: Vertex Degree
![Page 43: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/43.jpg)
Example: Vertex DegreeA: 5B: 0C: 0D: 0E: 0F: 0
![Page 44: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/44.jpg)
F
E
Example: Oldest Follower
D
B
A
CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices
![Page 45: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/45.jpg)
Specialized Graph Systems
47
Shared State[UAI’10, VLDB’12]
PregelMessaging
[PODC’09, SIGMOD’10]
Many OthersGiraph, Stanford GPS, Signal-Collect,
Combinatorial BLAS, BoostPGL, …
![Page 46: GraphX : Graph Analytics on Spark](https://reader035.fdocuments.us/reader035/viewer/2022081503/56816946550346895de0d2d4/html5/thumbnails/46.jpg)
class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T, direction: EdgeDir):
Graph[T, E]}
GraphX API