Tree and Graph Processing On Hadoop

27
1 Tree and Graph Processing On Hadoop Ted Malaska

description

Tree and Graph Processing On Hadoop. Ted Malaska. Schedule. Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery - PowerPoint PPT Presentation

Transcript of Tree and Graph Processing On Hadoop

Page 1: Tree and Graph Processing  On Hadoop

1

Tree and Graph Processing On Hadoop

Ted Malaska

Page 2: Tree and Graph Processing  On Hadoop

2

Schedule

• Intro• Overview of Hadoop and Eco-System• Summarize Tree Rooting• MR Overview/Implementation Options• Hbase Overview/Implementation Options• Giraph Overview/Implementation Options• Spark Overview/Implementation Options• Summery• Quesitons

Page 3: Tree and Graph Processing  On Hadoop

3

Intro

• Hi there

Page 4: Tree and Graph Processing  On Hadoop

4

Overview of Hadoop and Eco-System

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

Map

Red

uce

Pig

Crun

ch

Hive

Gira

ph

Sqoo

p

Flum

e

Kafk

a

Stor

m

Spar

k St

ream

ing

Spar

k

Impa

la

Mah

out

Ory

x

R Pyth

on S

trea

min

g

SAS

HBas

e

Accu

mul

o

NFS

Sear

ch S

olR

Page 5: Tree and Graph Processing  On Hadoop

5

In Scope for Tonight

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

Map

Red

uce

Pig

Crun

ch

Hive

Gira

ph

Sqoo

p

Flum

e

Kafk

a

Stor

m

Spar

k St

ream

ing

Spar

k

Impa

la

Mah

out

Ory

x

R Pyth

on S

trea

min

g

SAS

HBas

e

Accu

mul

o

NFS

Sear

ch S

olR

Page 6: Tree and Graph Processing  On Hadoop

6

Summarize Tree Rooting

• Basic Tree

0

1 1

22 2

2

3

33

True Root

Leafs

Branches

Vertex

Edge

Depth

Page 7: Tree and Graph Processing  On Hadoop

7

Summarize Tree Rooting

• More Complex Tree

0

11

22 2

2

3

32

Circular Link

Multiple Parents

Page 8: Tree and Graph Processing  On Hadoop

8

Summarize Tree Rooting

• Merging Trees• Borderline True Graph Problem

0

11

22 2

2

3

32

0

0

Multi RootedVertex

True RootTrue Root

Page 9: Tree and Graph Processing  On Hadoop

9

Summarize Tree Rooting

• Know your data

Page 10: Tree and Graph Processing  On Hadoop

10

Basic Storage Format

• <NodeID>|<EdgeID>

• Example• 101• 101|201• 101|202• 201• 202|301• 301

Page 11: Tree and Graph Processing  On Hadoop

11

Preprocessing

• Terming Data• Nodes and edges have data• Data has weight• Normally linkage information is under 10% of true data size

• Organize Data by Partitioning

Page 12: Tree and Graph Processing  On Hadoop

12

Basic Solution

• Step 1: Identify Roots• Echo to all edges• Vertexes with that receive no echoes are roots• Root the root

• Step 2: Walk the tree• Echo from last newly rooted Vertex to all edges• If vertex is not already rooted then root it.

• 101• 101|201• 101|202• 201• 202|301• 301

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:Null• 202|301|R:Null• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:101

Page 13: Tree and Graph Processing  On Hadoop

13

Map Reduce

• Massive parallel processing on Hadoop• Based on the Google 2004 MapReduce white paper• Able to process PBs of data

Page 14: Tree and Graph Processing  On Hadoop

14

Map Reduce

Data Blocks

Data Blocks

Data Blocks

Mapper

Mapper

Mapper

Sort & Shuffle

Sort & Shuffle

Sort & Shuffle

Mapper

Mapper

Data Blocks

Data Blocks

Page 15: Tree and Graph Processing  On Hadoop

15

Map Reduce

• Self Joins• Always dumping two output:

• Newly Rooted• Still Un-Rooted

All Data

Un-Rooted

Newly Rooted

Un-Rooted

Newly Rooted

Old Rooted 0

MR - Stage0

Root Identifying

MR – Stage1

Rooting

Un-Rooted

Newly Rooted

Old Rooted 0

MR – Stage2

RootingOld Rooted 1

Page 16: Tree and Graph Processing  On Hadoop

16

Map Reduce

• Great for large batch operations• No memory limit• Not good at iterations

Page 17: Tree and Graph Processing  On Hadoop

17

HBase

• Largest and Most used NoSql Implementation in the World• Based on the Google 2006 BigTable white paper• Imagine it like a giant HashMap with keys and values• Handles 100k of operations a second on even a small 10 node cluster

Page 18: Tree and Graph Processing  On Hadoop

18

HBase Getting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

Block Cache Block Cache Block Cache

Page 19: Tree and Graph Processing  On Hadoop

19

HBase Putting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

WAL

MemStore

HFile

HFile

HFile

WAL

MemStore

WAL

MemStore

Page 20: Tree and Graph Processing  On Hadoop

20

HBase

• Good for graph traversing• Bad for large batch processing

• Scan rate about 8x slower then HDFS• Good for end of a long tail

Page 21: Tree and Graph Processing  On Hadoop

21

Giraph

• System built for Large Batch Graph Processing • Based on Pregel 2009 white paper• Hardened by LinkedIn and FaceBook• Recorded to handle up to a Trillion edges

Page 22: Tree and Graph Processing  On Hadoop

22

Giraph Loading

Data Blocks

Data Blocks

Data Blocks

Worker

Worker

Worker

Worker

Master

Page 23: Tree and Graph Processing  On Hadoop

23

Com

mun

icati

on

Giraph (Bulk Synchronous Parallel)

Worker Worker Worker

Loca

l ver

tex

com

putin

g

Barrier synchronization

Loca

l ver

tex

com

putin

g

Loca

l ver

tex

com

putin

g

Page 24: Tree and Graph Processing  On Hadoop

24

Giraph

• Most mature bulk graph processing out there• Of all the solutions, most graph focused

Page 25: Tree and Graph Processing  On Hadoop

25

Spark

• At Berkeley around 2011 some asked is we could do better then MR• Take advantage of lower cost memory• Building on everything before

Page 26: Tree and Graph Processing  On Hadoop

26

Spark

WorkerDag Scheduler

(Like a queue planner

Spark Worker

RDD Objects

Task Threads

Block Manager

Rdd1.join(rdd2).groupBy(…).filter(…)

Task Scheduler

Threads

Block Manager

ClusterManager

Page 27: Tree and Graph Processing  On Hadoop

27

Spark

• Implementations• Onion MR approach with Basic Spark• Pregel approach with Bagel or GraphX

• Bagel is a Façade over Generic Spark Functionality• GraphX is an effort extend to Spark

• Less code• Learning curve • Its Raw will be changing a lot in the next year