Apache Giraph: Large-scale graph processing done better
-
Upload
manuel-coppotelli -
Category
Education
-
view
245 -
download
3
Transcript of Apache Giraph: Large-scale graph processing done better
Apache GiraphLarge-scale graph processing done better
Data Mining Class
Sapienza, University of Rome
A. Y. 2016 - 2017
Basic concepts Let’s start Get our hands dirty
Hi!Simone [email protected]
https://it.linkedin.com/in/simone-santacroce-272739134
Manuel [email protected]
https://it.linkedin.com/in/manuelcoppotelli
George Adrian [email protected]
https://it.linkedin.com/in/george-adrian-munteanu-707744134
Lorenzo [email protected]
https://www.linkedin.com/in/lorenzo-marconi-1a2580105
Antonio La [email protected]
https://www.linkedin.com/in/antonio-la-torre-768738134
Lucio [email protected]
https://www.linkedin.com/in/lucio-burlini-827739134
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graphs 101
• Graph: representation of a setof objects G =< V ,E >
• Captures pairwise relationshipsbetween objects
• Can have directions, weights,. . .
Apache Giraph
Basic concepts Let’s start Get our hands dirty
A computer network
Apache Giraph
Basic concepts Let’s start Get our hands dirty
A road map
Apache Giraph
Basic concepts Let’s start Get our hands dirty
The web
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Social networks
• Both physical and Internet mediated
• Users are vertices
• Any kind of interaction generates edges
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graph are huge!
∼ 50B pages
∼ 1.1B users
∼ 570M users
∼ 530M users
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
So what?
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Why not MapReduce?1
MapReduce is the current standard to manage big sets of data forintensive computing.
Repeat N times . . .1https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
MapReduce Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send values and structure
• Extensive IO at input, shuffle & sort, output
Disk I/O and Job scheduling quickly dominate the algorithm
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Giraph
Apache Giraph
Basic concepts Let’s start Get our hands dirty
The Story
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Think like a vertex
• Each vertex has an id, a value, a list of adjacent neighbors andcorresponding edge values
• Vertices implement algorithms by sending messages• Messages are delivered at the start of each superstep
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Bulk Synchronous Parallel (BSP)
• Master-Slave architecture
• Batch oriented processing
• Computation happens in-memory
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Advantages
• No locks: message-based communication
• No semaphores: global synchronization
• Iteration isolation: massively parallelizable
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Architecture
Single Map-only Job
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Jobs Schema
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure
3The function has to be both commutative and associative
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure
3The function has to be both commutative and associative
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure3The function has to be both commutative and associative
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
Basic concepts Let’s start Get our hands dirty
LongLongNullTextInputFormat
org.apache.giraph.io.formats.LongLongNullTextInputFormat
If there is ad edge from Node 1 to Node 2 thenNode 2 appears in the neighbor list of Node 1
<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
...
Apache Giraph
Basic concepts Let’s start Get our hands dirty
IdWithValueTextOutputFormat
org.apache.giraph.io.formats.IdWithValueTextOutputFormat
For each node print the Node ID and the Node Value
<NODE1 ID> <TAB> <NODE1 VALUE>
<NODE2 ID> <TAB> <NODE2 VALUE>
...
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Demo
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages
◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =1 − d
N+ d ·
∑u→v
PageRanki (u)
O(u)
• Where:◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =1 − d
N+ d ·
∑u→v
PageRanki (u)
O(u)
• Where:◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1.0
1.0
1.0
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1.0
1.0
1.0
0.5
0.5
1
1
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1 · 0.85 + 0.1
5/3
0.5 · 0.85 + 0.15/3
1.5 · 0.85 + 0.15/3
0.5
0.5
1
1
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
0.43
0.21
0.64
Apache Giraph
Basic concepts Let’s start Get our hands dirty
JsonLongDoubleFloatDoubleVertexInputFormat
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
Express both nodes and edges information using JSON arrays
[<vertex id>, <vertex value>,
[
[<dest vertex id>, <edge value>],
...
]
]
NoticeFore more in/out formats visit https://github.com/apache/giraph/tree/trunk/giraph-core/src/main/java/org/apache/giraph/io/formats
Apache Giraph
Basic concepts Let’s start Get our hands dirty
DemoDemo code
https://github.com/manuelcoppotelli/giraph-demo
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Q? & A!
Apache Giraph
Basic concepts Let’s start Get our hands dirty
Thank you for your attention
Contact us for any questions or problem
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Homework
https://github.com/manuelcoppotelli/giraph-homework
Apache Giraph