Giraph: Production-grade graph processing infrastructure...

Giraph: Production-grade graph processing infrastructure for trillion

edge graphs

8/12/2014 ATPESC

Avery Ching

Motivation

Apache Giraph• Inspired by Google’s Pregel but runs on Hadoop

• “Think like a vertex”

• Maximum value vertex example

Processor 1

Processor 2

Giraph on Hadoop / Yarn

MapReduce YARN

Giraph

Hadoop 0.20.x

Hadoop 0.20.203

Hadoop 2.0.x

Hadoop 1.x

Send page rank value to neighbors for 30 iterations

Calculate updated page rank value from neighbors

Page rank in Giraph!!public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum); } if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }

Giraph compared to MPIGiraph’s vertex centric API is higher-level (and narrower) than MPI

• Vertex message queues vs process-level messaging

• Enforced BSP model

• Data distribution, checkpointing handled by Giraph

• Giraph aggregators are user-level MPI_Allreduce reductions

Java (Giraph) vs C/C++ (MPI)

Scheduled with Hadoop clusters

Easy integration with other Hadoop pipelines

Apache Giraph data flow

Loading the graph

Input format

Split 0

Split 1

Split 2

Split 3

Load/ Send

Storing the graph

Output format

Part 0

Part 1

Part 2

Part 3

Part 0

Part 1

Part 2

Part 3

Compute / Iterate

Compute/ Send

Messages

Compute/ Send

Messages

In-memory graph

Part 0

Part 1

Part 2

Part 3M

Send stats/iterate!

Pipelined computationMaster “computes”

• Sets computation, in/out message, combiner for next super step

• Can set/modify aggregator values

Worker 0

Worker 1

Master

phase 1a

phase 1b

phase 2

phase 3

Use case

Affinity propagationFrey and Dueck “Clustering by passing messages between data points” Science 2007

Organically discover exemplars based on similarity

Initialization Intermediate Convergence

Responsibility r(i,k)

• How well suited is k to be an exemplar for i?

Availability a(i,k)

• How appropriate for point i to choose point k as an exemplar given all of k’s responsibilities?

Update exemplars

• Based on known responsibilities/availabilities, which vertex should be my exemplar?

* Dampen responsibility, availability

3 stages

Responsibility

Every vertex i with an edge to k maintains responsibility of k for i

Sends responsibility to k in ResponsibilityMessage (senderid, responsibility(i,k))

r(c,a)r(d,a)

r(b,d)r(b,a)

Availability

Vertex sums positive messages

Sends availability to i in AvailabilityMessage (senderid, availability(i,k))

a(c,a)a(d,a)

a(b,d)

a(b,a)

Update exemplars

Dampens availabilities and scans edges to find exemplar k

Updates self-exemplar

Bupdate update

update update

exemplar=a exemplar=d

exemplar=a exemplar=a

Master logic

calculate responsibility

calculate availability

update exemplars

initial state

if (exemplars agree they are exemplars && changed exemplars < ∆) then halt, otherwise continue

Performance & Scalability

Example graph sizes

Twitter 255M MAU (https://about.twitter.com/company), 208 average followers (Beevolve 2012)

→ Estimated >53B edges

Facebook 1.28B MAU (Q1/2014 report), 200+ average friends (2011 S1)

→ Estimated >256B edges

Graphs used in research publications

Clueweb 09 Twitter dataset Friendster Yahoo! web

Rough social network scale*

Twitter Est* Facebook Est*

Faster than Hive?

Application Graph Size CPU Time Speedup Elapsed Time Speedup

Page rank(single iteration)

400B+ edges 26x 120x

Friends of friends score 71B+ edges 12.5x 48x

Apache Giraph scalabilityScalability of workers

(200B edges)

# of Workers

50 100 150 200 250 300

Giraph Ideal

Scalability of edges (50 workers)

# of Edges

1E+09 7E+10 1E+11 2E+11

Giraph Ideal

Trillion social edges page rankM

6/30/2013 6/2/2014

Improvements

• GIRAPH-840 - Netty 4 upgrade

• G1 Collector / tuning

Graph partitioning

Why balanced partitioningRandom partitioning == good balance

BUT ignores entity affinity

Balanced partitioning applicationResults from one service:

Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2

Balanced label propagation results

* Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13

Leveraging partitioningExplicit remapping

Native remapping

• Transparent

• Embedded

Explicit remapping

Id EdgesSan Jose

(Chicago, 4)

(New York, 6)

Chicago(San Jose, 4)

(New York, 3)

New York

(San Jose, 6)

(Chicago, 3)

Original graph

Partitioning Mapping

Id Alt Id

San Jose 0

Chicago 1

New York 2

Id Edges

0(1, 4)

(2, 6)

1(0, 4)

(2, 3)

2(0, 6)

(1, 3)

Remapped graph

Reverse partitionmapping

Alt Id Id

0 San Jose

1 Chicago

2 New YorkCo

Compute output

Id Distance

Join Id DistanceSan Jose 0

Chicago 4

New York 6

Final compute output

Native transparent remapping

Id EdgesSan Jose

(Chicago, 4)

(New York, 6)

Chicago(San Jose, 4)

(New York, 3)

New York

(San Jose, 6)

(Chicago, 3)

Original graph

Id Group

San Jose 0

Chicago 1

New York 2

Id DistanceSan Jose 0

Chicago 4

New York 6

Id Group Edges

San Jose 0(Chicago, 4)

(New York, 6)

Chicago 1(San Jose, 4)

(New York, 3)

New York 2(San Jose, 6)

(Chicago, 3)

Original graph with group information

“San

Native embedded remapping

Id Edges

0(1, 4)

(2, 6)

1(0, 4)

(2, 3)

2(0, 6)

(1, 3)

Original graph

Id Mach

Id Distance

Top bits machine, Id Edges

0, 0(1, 4)

(2, 6)

1, 1(0, 4)

(2, 3)

0, 2(0, 6)

(1, 3)

Original graph with mapping embedded in Id

“San

Not all graphs can leverage this technique, Facebook can since ids are longs with unused bits.

Remapping comparison

ExplicitNative

TransparentNative

Embedded

• Can also add id compression

• No application change, just additional input parameters

• Utilize unused bits

•Application aware of remapping

•Workflow complexity •Pre and post joins overhead

• Additional memory usage on input

• Group information uses more memory

• Application changes Id type

Partitioning experiments

345B edge page rankSe

Random 47% Local 60% Local

Message explosion

Avoiding out-of-coreExample: Mutual friends calculation between neighbors

1. Send your friends a list of your friends

2. Intersect with your friend list

1.23B (as of 1/2014)

200+ average friends (2011 S1)

8-byte ids (longs)

= 394 TB / 100 GB machines

3,940 machines (not including the graph)

A:{D} D:{A,E} E:{D}

B:{} C:{D} D:{C}

A:{C} C:{A,E} E:{C}

!C:{D} D:{C}

!!E:{}

Superstep splittingSubsets of sources/destinations edges per superstep

* Currently manual - future work automatic!

Sources: A (on), B (off) Destinations: A (on), B (off)

Sources: A (on), B (off) Destinations: A (off), B (on)

Sources: A (off), B (on) Destinations: A (on), B (off)

Sources: A (off), B (on) Destinations: A (off), B (on)

Giraph in productionOver 1.5 years in production

Hundreds of production Giraph jobs processed a week

• Lots of untracked experiments

30+ applications in our internal application repository

Sample production job - 700B+ edges

Job times range from minutes to hours

GiraphicJam demo

Giraph related projects

Graft: The distributed Giraph debugger

Giraph roadmap

2/12 - 0.1 6/14 - 1.15/13 - 1.0

The future

Scheduling jobs

Snapshot automatically after a time period and restart at end of queue

Democratize Giraph?Higher level primitives

(i.e. HelP - Salihoglu)

• Filter!• Aggregating Neighbor

Values (ANV)!• Local Update of Vertices

(LUV)!• Update Vertices Using

One Other Vertex (UVUOV)!

• Updates vertex values by using a value from one other vertex (not necessarily a neighbor)!

• Form Supervertices (FS)!• Aggregate Global Value

(AGV)!

Graph traversal language (i.e. Gremlin)

// calculate basic!// collaborative filtering for!// vertex 1!m = [:]!g.v(1).out('likes').in('likes').out('likes').groupCount(m)!m.sort{-it.value}!!// calculate the primary!// eigenvector (eigenvector!// centrality) of a graph!m = [:]; c = 0;!g.V.as('x').out.groupCount(m).loop('x'){c++ < 1000}!m.sort{-it.value}!

Implement lots of algorithms?

./run-page-rank !-input pages !-output page_rank_output!!./run-mutual-friends !-input friendlist !-output pair_count_output!!./run-graph-partitioning !-input vertices_edges !-output vertex_partition_list!!!!!!!!

Future workInvestigate alternative computing models

• Giraph++ (IBM research)

• Giraphx (University at Buffalo, SUNY)

Performance

Applications

Our team

Maja Kabiljo

Sergey Edunov

Pavan Athivarapu

Avery Ching

Sambavi Muthukrishnan

Giraph: Production-grade graph processing infrastructure...

Documents

Transcript of Giraph: Production-grade graph processing infrastructure...

MPI and Hybrid Programming Models - MCSpress3.mcs.anl.gov › computingschool › files › 2014 › 01 › B... · 6 Myths About the MPI + OpenMP Hybrid Model 1. Never works •

bookingdance.combookingdance.com/Beijing-Press3.pdf · Created Date: 1/8/2009 1:49:51 PM

snow shoe press3€¦ · Title: snow_shoe_press3 Created Date: 12/7/2012 3:19:14 PM

BNL:%Kerstin%Kleese%van%Dam …press3.mcs.anl.gov/atpesc/files/2016/08/KleeseVDam_430aug9_Provenance.pdf• Code Region PerformanceMetricsto be collectedfor eachcallto the coderegion,

Quantum Computing Trends - press3.mcs.anl.gov

FASTMath: An overview of mathematical algorithms …press3.mcs.anl.gov/computingschool/files/2013/07/...Software Complexity • Interoperating numerical software • New algorithms

TORSTEN HOEFLER Interconnects and Architectural Impacts ...press3.mcs.anl.gov/computingschool/files/2014/08/hoefler-atpesc.pdf · Interconnects and Architectural Impacts ... Pebble

Press3 H4 NK - Takayuki Ochiai€¦ · Title: Press3_H4_NK Created Date: 7/31/2015 6:57:19 PM

SUNDIALS: Suite of Nonlinear and Differential/Algebraic ...press3.mcs.anl.gov/computingschool/files/2014/01/fastmath-sundials... · Applies advanced error estimators, adaptive time

Task Mapping of Parallel Applications using Graph …press3.mcs.anl.gov/computingschool/files/2014/08/deveci...Task Mapping of Parallel Applications using Graph Partitioners Mehmet

Data Collection for Performance Analysis and Visualization ...press3.mcs.anl.gov/summerofcodes2016/files/2016/07/Ross-ROSSData.pdf · ROSS data collection Reverse computation adds

Overview of Numerical Algorithms and Software for Extreme …press3.mcs.anl.gov/atpesc/...2019_Track-5_4_8-6_930am_McInnes-O… · Overview of Numerical Algorithms and Software for

Parallel HDF5 - MCSpress3.mcs.anl.gov/computingschool/files/2013/07/HDF5-Parallel.pdf · HDF5 MPI-I/O consistency semantics • MPI I/O provides atomicity and sync-barrier-sync features

Lori Diachin, Center for Applied Scientific …press3.mcs.anl.gov/.../2014/01/Diachin_APTESC_Teams_2014.pdfLori Diachin, Center for Applied Scientific Computing August 2014 LLNL-PRES-659458

Why are the biggest supercomputers so challenging?press3.mcs.anl.gov/computingschool/files/2014/01/HPC...TOP500 Release June 2014 Category Cores per Socket Submit Legend: 4, 6, 8,

Silicon Photonics for Extreme Scale Computing Challenges ...press3.mcs.anl.gov/atpesc/files/2017/08/ATPESC... · Silicon Photonics for Extreme Scale Computing ... Photonic Computing

SUNDIALS: Suite of Nonlinear and Differential/Algebraic ...press3.mcs.anl.gov/atpesc/files/2016/08/Woodward_215aug5_fastm… · Differential/Algebraic Equation Solvers . 2 . SUite

Graduate HANDBOOK - School of ComputingSchool of Computing Graduate Handbook - 2017-2018 The School of Computing was originally founded as the Computer Science Department at the University

Planning Simulations Planning Execution Dean Townsley ...press3.mcs.anl.gov/atpesc/files/2016/08/Townsley_930aug9_planningSim.pdfScaling work unit block-step Scaling limits Planning

Micro Architecture for Exascale - MCSpress3.mcs.anl.gov/computingschool/files/2013/07/...M E M I O Argonne Training Program on Extreme-Scale Computing 2013 . Knights Corner Coprocessor