GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...
Transcript of GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing...
![Page 1: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/1.jpg)
GraphFrames: An Integrated API for Mixing Graph and Relational QueriesAnkur DaveUC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
UC BERKELEY
![Page 2: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/2.jpg)
+ Graph Queries
2016Apache Spark + GraphFrames
Trend: Unified Graph Analysis
+ Graph Algorithms
2013Apache Spark + GraphX
Relational Queries
2009Spark
![Page 3: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/3.jpg)
Graph Algorithms vs. Graph Queries
≈x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
![Page 4: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/4.jpg)
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
![Page 5: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/5.jpg)
Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank
// Iterate until convergence wikipedia.pregel(sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum})
Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")
.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)
![Page 6: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/6.jpg)
Separate SystemsGraph Algorithms Graph Queries
![Page 7: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/7.jpg)
Raw Wikipedia
< / >< / >< / >XML
Text Table
Edit GraphEdit Table
Frequent Collaborators
Problem: Mixed Graph AnalysisHyperlinks PageRank
Article Text
User Article
Vandalism Suspects
User User
User Article
![Page 8: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/8.jpg)
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFrames API
Pattern Query Optimizer
![Page 9: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/9.jpg)
GraphFrames API• Unifies graph algorithms, graph queries, and relational operations (DataFrames)• Designed for interactive use• Available in Scala, Java, and Python
class GraphFrame {def vertices: DataFramedef edges: DataFrame
def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...
}
![Page 10: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/10.jpg)
Implementation
Parsed Pattern
Logical Plan
Materialized Views
Optimized Logical Plan
DataFrameResult
Query String
Graph–RelationalTranslation Join Elimination
and Reordering
Spark SQL
View SelectionGraph
AlgorithmsGraphX
![Page 11: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/11.jpg)
Graph–Relational Translation
B
D
A
C
Existing Logical PlanOutput: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
Vertex Table
⋈D=ID
![Page 12: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/12.jpg)
Join Elimination
Src Dst1 21 32 32 5
Edges
ID Attr1 A2 B3 C4 D
VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;
Unnecessary join
can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:
SELECT src, dst FROM edges;
![Page 13: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/13.jpg)
Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
UpdatedPageRanks
B
A
C
D
A
![Page 14: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/14.jpg)
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A CB CC D
Graph
User-Defined Views
PageRank
CommunityDetection
…
Graph Queries
![Page 15: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/15.jpg)
Join Reordering
A → B B → A
⋈A, BB → D
C→ B⋈BB → E⋈B
C→ D⋈BC→ E⋈C, D
⋈C, EExample Query
Left-Deep Plan Bushy Plan
A → B B → A
⋈A, B
B → D C→ B
⋈B
B → E⋈B⋈B⋈B, C
User-Defined View
![Page 16: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/16.jpg)
Query Planning AlgorithmDynamic programming algorithm based on:J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.
1. Considers all left-deep plans, and a subset of bushy plans• Bushy plans to explore are chosen using layered-DAG and cycle-detection heuristics
2. Considers using each view that is exactly equivalent to a plan subtree• Result: Selects the largest of multiple hierarchically contained views
![Page 17: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/17.jpg)
EvaluationFaster than Neo4j for unanchored pattern queries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Que
ry la
tenc
y, s
Anchored Pattern Query
01020304050607080
GraphFrames Neo4j
Que
ry la
tenc
y, s
Unanchored Pattern Query
Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.
![Page 18: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/18.jpg)
EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-i
tera
tion
runt
ime,
s
PageRank Performance
Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.
![Page 19: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/19.jpg)
EvaluationRegistering the right views can greatly improve performance for some queries
Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.
![Page 20: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/20.jpg)
Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node
![Page 21: GraphFrames: An Integrated API for Mixing Graph and ...GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with AlekhJindal](https://reader036.fdocuments.us/reader036/viewer/2022071411/610641f86ca33b2fa111cc10/html5/thumbnails/21.jpg)
Try It Out!Released as a Spark Package at:
https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.