GraphFrames: Graph Queries in Spark SQLAnkur DaveUC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber),Joseph Gonzalez (UC Berkeley), Matei Zaharia (MIT and Databricks)
+ Graph Queries
2016Spark + GraphFrames
Support for Graph Analysis in Spark
+ Graph Algorithms
2013Spark + GraphX
Relational Queries
2009Spark
Graph Updates + Anchored Traversals
Neo4j, etc.
Graph Algorithms vs. Graph Queries
≈x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank
// Iterate until convergence wikipedia.pregel(sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum})
Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")
.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)
Separate SystemsGraph Algorithms Graph Queries
Raw Wikipedia
< / >< / >< / >XML
Text Table
Edit GraphEdit Table
Frequent Collaborators
Problem: Mixed Graph AnalysisHyperlinks PageRank
User Product
User Article
Vandalism Suspects
User User
User Article
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFrames API
Pattern Query Optimizer
GraphFrames API• Unifies graph algorithms, graph queries, and DataFrames• Available in Scala and Python
class GraphFrame {def vertices: DataFramedef edges: DataFrame
def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...
}
Implementation
Parsed Pattern
Logical Plan
Materialized Views
Optimized Logical Plan
DataFrameResult
Query String
Graph–RelationalTranslation Join Elimination
and Reordering
Spark SQL
View SelectionGraph
AlgorithmsGraphX
Graph–Relational Translation
B
D
A
C
Existing Logical PlanOutput: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
Vertex Table
⋈D=ID
Materialized View Selection
GraphX: Triplet view enables efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
UpdatedPageRanks
B
A
C
D
A
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A CB CC D
Graph
User-Defined Views
PageRank
CommunityDetection
…
Graph Queries
Join Elimination
Src Dst1 21 32 32 5
Edges
ID Attr1 A2 B3 C4 D
VerticesSELECT src, dst, attr AS src_attrFROM edges INNER JOIN vertices ON src = id;
Standard vertex-edge join:
SELECT src, dstFROM edges INNER JOIN vertices ON src = id;
Unnecessary join
can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation
Join Reordering
A → B B → A
⋈A, BB → D
C→ B⋈BB → E⋈B
C→ D⋈BC→ E⋈C, D
⋈C, EExample Query
Left-Deep Plan Bushy Plan
A → B B → A
⋈A, B
B → D C→ B
⋈B
B → E⋈B⋈B⋈B, C
User-Defined View
EvaluationFaster than Neo4j for unanchored pattern queries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Que
ry la
tenc
y, s
Anchored Pattern Query
01020304050607080
GraphFrames Neo4j
Que
ry la
tenc
y, s
Unanchored Pattern Query
Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.
EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-i
tera
tion
runt
ime,
s
PageRank Performance
Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.
EvaluationRegistering the right views can greatly improve performance
Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node
Try It Out!Preview available for Spark 1.4+ at:
https://github.com/graphframes/graphframesThanks to Databricks contributors Joseph Bradley, Xiangrui Meng, and Timothy Hunter.
Watch for the release on Spark Packages in the coming weeks.
Top Related