GraphFrames: Graph Queries in Spark SQL by Ankur Dave

20
GraphFrames: Graph Queries in Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Joseph Gonzalez (UC Berkeley), Matei Zaharia (MIT and Databricks)

Transcript of GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Page 1: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

GraphFrames: Graph Queries in Spark SQLAnkur DaveUC Berkeley AMPLab

Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber),Joseph Gonzalez (UC Berkeley), Matei Zaharia (MIT and Databricks)

Page 2: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

+ Graph Queries

2016Spark + GraphFrames

Support for Graph Analysis in Spark

+ Graph Algorithms

2013Spark + GraphX

Relational Queries

2009Spark

Graph Updates + Anchored Traversals

Neo4j, etc.

Page 3: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Graph Algorithms vs. Graph Queries

≈x

PageRank

Alternating Least Squares

Graph Algorithms Graph Queries

Page 4: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1

Article 2

Editor 1

Editor 2

same day} same day}

Page 5: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel(sendMsg = { e =>

e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

0.15 + 0.85 * msgSum})

Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))

.sort("d1 + d2".desc).take(10)

Page 6: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Separate SystemsGraph Algorithms Graph Queries

Page 7: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Raw Wikipedia

< / >< / >< / >XML

Text Table

Edit GraphEdit Table

Frequent Collaborators

Problem: Mixed Graph AnalysisHyperlinks PageRank

User Product

User Article

Vandalism Suspects

User User

User Article

Page 8: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL

GraphFrames API

Pattern Query Optimizer

Page 9: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

GraphFrames API• Unifies graph algorithms, graph queries, and DataFrames• Available in Scala and Python

class GraphFrame {def vertices: DataFramedef edges: DataFrame

def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

}

Page 10: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Implementation

Parsed Pattern

Logical Plan

Materialized Views

Optimized Logical Plan

DataFrameResult

Query String

Graph–RelationalTranslation Join Elimination

and Reordering

Spark SQL

View SelectionGraph

AlgorithmsGraphX

Page 11: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Graph–Relational Translation

B

D

A

C

Existing Logical PlanOutput: A,B,C

Src Dst

⋈C=Src

Edge Table

ID Attr

Vertex Table

⋈D=ID

Page 12: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Materialized View Selection

GraphX: Triplet view enables efficient message-passing algorithms

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A C

B C

C D

Graph

+

UpdatedPageRanks

B

A

C

D

A

Page 13: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A CB CC D

Graph

User-Defined Views

PageRank

CommunityDetection

Graph Queries

Page 14: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Join Elimination

Src Dst1 21 32 32 5

Edges

ID Attr1 A2 B3 C4 D

VerticesSELECT src, dst, attr AS src_attrFROM edges INNER JOIN vertices ON src = id;

Standard vertex-edge join:

SELECT src, dstFROM edges INNER JOIN vertices ON src = id;

Unnecessary join

can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation

Page 15: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Join Reordering

A → B B → A

⋈A, BB → D

C→ B⋈BB → E⋈B

C→ D⋈BC→ E⋈C, D

⋈C, EExample Query

Left-Deep Plan Bushy Plan

A → B B → A

⋈A, B

B → D C→ B

⋈B

B → E⋈B⋈B⋈B, C

User-Defined View

Page 16: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

EvaluationFaster than Neo4j for unanchored pattern queries

0

0.5

1

1.5

2

2.5

GraphFrames Neo4j

Que

ry la

tenc

y, s

Anchored Pattern Query

01020304050607080

GraphFrames Neo4j

Que

ry la

tenc

y, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

Page 17: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

0

1

2

3

4

5

6

7

GraphFrames GraphX Naïve Spark

Per-i

tera

tion

runt

ime,

s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

Page 18: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

EvaluationRegistering the right views can greatly improve performance

Page 19: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node

Page 20: GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Try It Out!Preview available for Spark 1.4+ at:

https://github.com/graphframes/graphframesThanks to Databricks contributors Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

Watch for the release on Spark Packages in the coming weeks.

[email protected]