GraphX : Graph Processing in a Distributed Dataflow...

GraphX : Graph Processing in a Distributed Dataflow Framework

OSDI 2014

BidyutHota

Agenda

•  Analytics space background

•  Motivation

•  Goal

•  Approach

•  Optimizations

•  Results

•  Flaws/Limitations

•  Questions

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Eg. Google Knowledge graph :570MVertices, 18B Edges ( as in Mid 2017)

Tables

Graphs

Systems landscape

Motivation

• Currently separate systems exist to compute on these data representation. • Ability to combine data

sources. • Enhance dataflow

frameworks to leverage inherent positives.

Current drawbacks of dataflow frameworks

•  Implementing iterative algorithms -> requires multiple stages of complex joins. • Do not cover common patterns in

graph algorithms -> Room for optimization. • Unlike Spark, no fine grained control

of data partitioning.

Current drawbacks of specialized systems

• Lacking ability for combining graphs with unstructured or tabular data • Systems favoring snapshot recovery

rather than fault tolerance like in Spark

What can we leverage?

•  Immutability of RDD’s • Reusing indices across graph and

collection views over iterations. •  Increase in performance

• General purpose distributed frameworks for graph computations • Comparable performances to

specialized graph processing systems

Approach

•  Unifies Tabular view and Graph view

•  Imbibe the best of specialized systems

•  Graph representation on dataflow frameworks

•  Optimizations •  Develop GraphX API on top of Spark

Graph approach: Page Rank example

•  Eg. Page Rank algorithm • Graph parallel abstraction • Define a vertex program •  Terminate when vertex programs vote to halt

Figure : PageRank in Pregel

Approach

• GAS (Gather Apply Scatter)

How to apply this in dataflow frameworks? • Map, group-by, join dataflow operators

Representing Property graphs as Tables

Never transfer edges

GraphX API

Using the dataflow operators

Logical representation Join of vertices table on edges table

Using the dataflow operators on vertex program

Userdefined

Optimizations

SpecializedDataStructure Vertex-cutPartitioning Remotecaching

ActiveSetTracking

Implementing Optimizations

• Reusable Hash index •  Sequential scan or clustered scan based on active set (Dynamic) •  Incremental updates • Automatic Join elimination

Additional optimizations: • Memory based shuffle • Batching and columnar structure • Variable Integer encoding

Results

Scaling for PageRank on Twitter dataset

Effect of partitioning on communication

Current Flaws •  Is not optimized for dynamic graphs. • Requires incremental updates to

routing table. •  Is not designed for streaming

applications.

• Asynchronous graph computation not available. This is where Naiad will outperform.

Questions

GraphX : Graph Processing in a Distributed Dataflow...

Documents

Transcript of GraphX : Graph Processing in a Distributed Dataflow...

GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and.

CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-bigtable.pdf · MOTIVATION Storing large amounts of semi-structured data - Traditionally done using database

GraphX : Graph Analytics on Spark

CS744 Series - LECO

Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Garaph: Efficient GPU-accelerated Graph Processing on a ... · [22], Giraph [2], GraphX [11], GraphLab [21], Power-Graph [10], PowerLyra [7] and Gemini [36]. These sys-tems attempt

GraphX: Unifying Table and Graph Analyticsjegonzal/assets/... · GraphX: Unifying Table and Graph Analytics ! Presented by Joseph Gonzalez! " Joint work with Reynold Xin, Daniel Crankshaw,

Graph-Parallel Querying of RDF with GraphX [1]schaetzl/talks/S2X_Big...31.08.2015 S2X: Graph-Parallel Querying of RDF with GraphX 2 Motivation Semantic Web has arrived in real-world

GraphX: Graph analytics for insights about developer communities

Cost Model for Pregel on GraphXrohit13k.github.io/doc/main_cost_model.pdfMany distributed graph computing (DGC) systems like PowerGraph [4] and Spark-GraphX [14] provide implementations

Large Scale Graph Processing - X-Stream and GraphX · Large Scale Graph Processing - X-Stream and GraphX Amir H. Payberah payberah@kth.se 2020-09-29

Rethinking SIMD Vectorization for In-Memory Databasespages.cs.wisc.edu/~shivaram/cs744-slides/cs744-harshal-simd.pdf · Sri Harshal Parimi . Motivation ´ Need for fast analytical

GraphX: Graph Processing in a Distributed Dataßow Frameworkbrents/cs494-cdcs/papers/GraphX.pdfsystem. We introduce GraphX, an embedded graph pro-cessing framework built on top of

Apache Spark GraphX highlights.

An excursion into Graph Analytics with Apache Spark GraphX

PRETZEL:Openingthe BlackBoxof Machine Learning Prediction ...pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-pretzel-qinyuan.pdf · Machine Learning Prediction Serving 1.Models are

Graphs are everywhere! Distributed graph computing with Spark GraphX

Machine Learning and GraphX

Pregel and GraphX

CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-trill.pdf · - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration