Download - GraphLab Tutorial

Transcript
Page 1: GraphLab  Tutorial

Carnegie Mellon University

GraphLab TutorialYucheng Low

2

Page 2: GraphLab  Tutorial

GraphLab Team

YuchengLow

AapoKyrola

JayGu

JosephGonzalez

DannyBickson

Carlos Guestrin

Page 3: GraphLab  Tutorial

GraphLab 0.5 (2010) Internal Experimental Code

Insanely Templatized

Development History

GraphLab 1 (2011)

Nearly Everything is Templatized

First Open Source Release (< June 2011 LGPL >= June 2011 APL)

GraphLab 2 (2012)

Many Things are Templatized

Shared Memory : Jan 2012Distributed : May 2012

Page 4: GraphLab  Tutorial

Graphlab 2 Technical Design Goals

Improved useabilityDecreased compile timeAs good or better performance than GraphLab 1Improved distributed scalability

… other abstraction changes … (come to the talk!)

Page 5: GraphLab  Tutorial

Development HistoryEver since GraphLab 1.0, all active development are open source (APL):

code.google.com/p/graphlabapi/

(Even current experimental code. Activated with a --experimental flag on ./configure )

Page 6: GraphLab  Tutorial

Guaranteed Target Platforms• Any x86 Linux system with gcc >= 4.2• Any x86 Mac system with gcc 4.2.1 ( OS X 10.5 ?? )

• Other platforms?

… We welcome contributors.

Page 7: GraphLab  Tutorial

Tutorial OutlineGraphLab in a few slides + PageRankChecking out GraphLab v2Implementing PageRank in GraphLab v2Overview of different GraphLab schedulersPreview of Distributed GraphLab v2

(may not work in your checkout!)Ongoing work… (however much as time allows)

Page 8: GraphLab  Tutorial

WarningA preview of code still in intensive development!

Things may or may not work for you!

Interface may still change!

GraphLab 1 GraphLab 2 still has a number of performance regressions we are ironing out.

Page 9: GraphLab  Tutorial

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

1 32

4 65

Page 10: GraphLab  Tutorial

10

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 11: GraphLab  Tutorial

11

Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge

Vertex Data:• Webpage• Webpage Features

Edge Data:• Link weight

Graph:• Link graph

Page 12: GraphLab  Tutorial

12

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 13: GraphLab  Tutorial

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

iNj

ji jRWiR

Update Functions

13

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

Page 14: GraphLab  Tutorial

14

Dynamic Schedule

e f g

kjih

dcbaCPU 1

CPU 2

a

h

a

b

b

i

Process repeats until scheduler is empty

Page 15: GraphLab  Tutorial

Source Code Interjection 1

Graph, update functions, and schedulers

Page 16: GraphLab  Tutorial

--scope=vertex--scope=edge

Page 17: GraphLab  Tutorial

Consistency

Trade-offConsistency “Throughput”

# “iterations” per second

Goal of ML algorithm: Converge

False Trade-off

Page 18: GraphLab  Tutorial

18

Ensuring Race-Free CodeHow much can computation overlap?

Page 19: GraphLab  Tutorial

19

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 20: GraphLab  Tutorial

Importance of ConsistencyFast ML Algorithm development cycle:

Build

Test

Debug

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

20

Page 21: GraphLab  Tutorial

Full Consistency

Guaranteed safety for all update functions

Page 22: GraphLab  Tutorial

Full Consistency

Parallel update only allowed two vertices apart Reduced opportunities for parallelism

Page 23: GraphLab  Tutorial

Obtaining More Parallelism

Not all update functions will modify the entire scope!

Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices

Page 24: GraphLab  Tutorial

Edge Consistency

Page 25: GraphLab  Tutorial

Obtaining More Parallelism

“Map” operations. Feature extraction on vertex data

Page 26: GraphLab  Tutorial

Vertex Consistency

Page 27: GraphLab  Tutorial

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

27

Page 28: GraphLab  Tutorial

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph dataSynced variables recomputed at defined intervals while update functions are running

Sync: HighestPageRank

Sync: Loglikelihood

28

Page 29: GraphLab  Tutorial

Source Code Interjection 2

Shared variables

Page 30: GraphLab  Tutorial

What can we do with these primitives?

…many many things…

Page 31: GraphLab  Tutorial

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

Page 32: GraphLab  Tutorial

NetflixSpeedup Increasing size of the matrix factorization

Page 33: GraphLab  Tutorial

Video Co-SegmentationDiscover “coherent”segment types acrossa video (extends Batra et al. ‘10)

1. Form super-voxels video2. EM & inference in Markov random field

Large model: 23 million nodes, 390 million edges

GraphLab

Ideal

Page 34: GraphLab  Tutorial

Many MoreTensor FactorizationBayesian Matrix FactorizationGraphical Model Inference/LearningLinear SVMEM clusteringLinear Solvers using GaBPSVDEtc.

Page 35: GraphLab  Tutorial

Distributed Preview

Page 36: GraphLab  Tutorial

GraphLab 2 Abstraction

Changes(an overview couple of them)

(Come to the talk for the rest!)

Page 37: GraphLab  Tutorial

Exploiting Update Functors

(for the greater good)

Page 38: GraphLab  Tutorial

Exploiting Update Functors (for the greater good)

1. Update Functors store state2. Scheduler schedules update functor instances.

3. We can use update functors as a controlled asynchronous message passing to communicate between vertices!

Page 39: GraphLab  Tutorial

Delta Based Update Functorsstruct pagerank : public iupdate_functor<graph, pagerank> {

double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=

other.delta; }void operator()(icontext_type& context) {

vertex_data& vdata = context.vertex_data();

vdata.rank += delta;if(abs(delta) > EPSILON) {

double out_delta = delta * (1 – RESET_PROB) *

1/context.num_out_edges(edge.source());

context.schedule_out_neighbors(pagerank(out_delta));}

}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);

Page 40: GraphLab  Tutorial

Asynchronous Message PassingObviously not all computation can be written this way. But when it can; it can be extremely fast.

Page 41: GraphLab  Tutorial

Factorized Updates

Page 42: GraphLab  Tutorial

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

};

Page 43: GraphLab  Tutorial

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

};

Atomic Single Vertex Apply

Parallel Scatter [Reschedule]

Parallel “Sum” Gather

Page 44: GraphLab  Tutorial

Decomposable Update Functors

Decompose update functions into 3 phases:

+ + … + Δ

Y YY

ParallelSum

User Defined:

Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

Y

YApply( , Δ) Y

Apply the accumulated value to center vertex

User Defined:

Apply

Y

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Page 45: GraphLab  Tutorial

Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;

void gather(icontext_type& context, const edge_type& edge) {

accum += context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=

other.accum; }void apply(icontext_type& context) {

vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)

* accum; residual = fabs(vdata.rank – old_value) /

context.num_out_edges();}void scatter(icontext_type& context, const

edge_type& edge) {if (residual > EPSILON)

context.schedule(edge.target(), pagerank());

}};

Page 46: GraphLab  Tutorial

Demo of *everything*

PageRank

Page 47: GraphLab  Tutorial

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)

Page 48: GraphLab  Tutorial

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)