apache giraph: large-scale graph processing done better
TRANSCRIPT
![Page 1: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/1.jpg)
Apache GiraphLarge-scale graph processing done better
Data Mining Class
Sapienza, University of Rome
A. Y. 2016 - 2017
![Page 2: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/2.jpg)
Basic concepts Let’s start Get our hands dirty
Hi!Simone [email protected]
https://it.linkedin.com/in/simone-santacroce-272739134
Manuel [email protected]
https://it.linkedin.com/in/manuelcoppotelli
George Adrian [email protected]
https://it.linkedin.com/in/george-adrian-munteanu-707744134
Lorenzo [email protected]
https://www.linkedin.com/in/lorenzo-marconi-1a2580105
Antonio La [email protected]
https://www.linkedin.com/in/antonio-la-torre-768738134
Lucio [email protected]
https://www.linkedin.com/in/lucio-burlini-827739134
Apache Giraph
![Page 3: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/3.jpg)
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
![Page 4: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/4.jpg)
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
![Page 5: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/5.jpg)
Basic concepts Let’s start Get our hands dirty
Graphs 101
• Graph: representation of a setof objects G =< V ,E >
• Captures pairwise relationshipsbetween objects
• Can have directions, weights,. . .
Apache Giraph
![Page 6: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/6.jpg)
Basic concepts Let’s start Get our hands dirty
A computer network
Apache Giraph
![Page 7: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/7.jpg)
Basic concepts Let’s start Get our hands dirty
A road map
Apache Giraph
![Page 8: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/8.jpg)
Basic concepts Let’s start Get our hands dirty
The web
Apache Giraph
![Page 9: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/9.jpg)
Basic concepts Let’s start Get our hands dirty
Social networks
• Both physical and Internet mediated
• Users are vertices
• Any kind of interaction generates edges
Apache Giraph
![Page 10: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/10.jpg)
Basic concepts Let’s start Get our hands dirty
Graph are huge!
∼ 50B pages
∼ 1.1B users
∼ 570M users
∼ 530M users
Apache Giraph
![Page 11: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/11.jpg)
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
![Page 12: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/12.jpg)
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
![Page 13: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/13.jpg)
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
![Page 14: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/14.jpg)
Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
So what?
Apache Giraph
![Page 15: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/15.jpg)
Basic concepts Let’s start Get our hands dirty
Why not MapReduce?1
MapReduce is the current standard to manage big sets of data forintensive computing.
Repeat N times . . .1https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Apache Giraph
![Page 16: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/16.jpg)
Basic concepts Let’s start Get our hands dirty
MapReduce Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send values and structure
• Extensive IO at input, shuffle & sort, output
Disk I/O and Job scheduling quickly dominate the algorithm
Apache Giraph
![Page 17: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/17.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
![Page 18: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/18.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
![Page 19: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/19.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
![Page 20: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/20.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
![Page 21: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/21.jpg)
Basic concepts Let’s start Get our hands dirty
Giraph
Apache Giraph
![Page 22: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/22.jpg)
Basic concepts Let’s start Get our hands dirty
The Story
Apache Giraph
![Page 23: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/23.jpg)
Basic concepts Let’s start Get our hands dirty
Think like a vertex
• Each vertex has an id, a value, a list of adjacent neighbors andcorresponding edge values
• Vertices implement algorithms by sending messages• Messages are delivered at the start of each superstep
Apache Giraph
![Page 24: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/24.jpg)
Basic concepts Let’s start Get our hands dirty
Bulk Synchronous Parallel (BSP)
• Master-Slave architecture
• Batch oriented processing
• Computation happens in-memory
Apache Giraph
![Page 25: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/25.jpg)
Basic concepts Let’s start Get our hands dirty
Advantages
• No locks: message-based communication
• No semaphores: global synchronization
• Iteration isolation: massively parallelizable
Apache Giraph
![Page 26: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/26.jpg)
Basic concepts Let’s start Get our hands dirty
Architecture
Single Map-only Job
Apache Giraph
![Page 27: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/27.jpg)
Basic concepts Let’s start Get our hands dirty
Jobs Schema
Apache Giraph
![Page 28: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/28.jpg)
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure
3The function has to be both commutative and associative
Apache Giraph
![Page 29: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/29.jpg)
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure
3The function has to be both commutative and associative
Apache Giraph
![Page 30: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/30.jpg)
Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure3The function has to be both commutative and associative
Apache Giraph
![Page 31: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/31.jpg)
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
![Page 32: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/32.jpg)
Basic concepts Let’s start Get our hands dirty
LongLongNullTextInputFormat
org.apache.giraph.io.formats.LongLongNullTextInputFormat
If there is ad edge from Node 1 to Node 2 thenNode 2 appears in the neighbor list of Node 1
<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
...
Apache Giraph
![Page 33: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/33.jpg)
Basic concepts Let’s start Get our hands dirty
IdWithValueTextOutputFormat
org.apache.giraph.io.formats.IdWithValueTextOutputFormat
For each node print the Node ID and the Node Value
<NODE1 ID> <TAB> <NODE1 VALUE>
<NODE2 ID> <TAB> <NODE2 VALUE>
...
Apache Giraph
![Page 34: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/34.jpg)
Basic concepts Let’s start Get our hands dirty
Demo
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Apache Giraph
![Page 35: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/35.jpg)
Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph
2 Let’s start• Out-Degree & In-Degree
3 Get our hands dirty• Simple PageRank
Apache Giraph
![Page 36: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/36.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
![Page 37: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/37.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
![Page 38: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/38.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages
◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
![Page 39: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/39.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
![Page 40: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/40.jpg)
Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
![Page 41: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/41.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =1 − d
N+ d ·
∑u→v
PageRanki (u)
O(u)
• Where:◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page
Apache Giraph
![Page 42: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/42.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =1 − d
N+ d ·
∑u→v
PageRanki (u)
O(u)
• Where:◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page
Apache Giraph
![Page 43: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/43.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1.0
1.0
1.0
Apache Giraph
![Page 44: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/44.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1.0
1.0
1.0
0.5
0.5
1
1
Apache Giraph
![Page 45: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/45.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1 · 0.85 + 0.1
5/3
0.5 · 0.85 + 0.15/3
1.5 · 0.85 + 0.15/3
0.5
0.5
1
1
Apache Giraph
![Page 46: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/46.jpg)
Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
0.43
0.21
0.64
Apache Giraph
![Page 47: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/47.jpg)
Basic concepts Let’s start Get our hands dirty
JsonLongDoubleFloatDoubleVertexInputFormat
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
Express both nodes and edges information using JSON arrays
[<vertex id>, <vertex value>,
[
[<dest vertex id>, <edge value>],
...
]
]
NoticeFore more in/out formats visit https://github.com/apache/giraph/tree/trunk/giraph-core/src/main/java/org/apache/giraph/io/formats
Apache Giraph
![Page 48: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/48.jpg)
Basic concepts Let’s start Get our hands dirty
DemoDemo code
https://github.com/manuelcoppotelli/giraph-demo
Apache Giraph
![Page 49: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/49.jpg)
Basic concepts Let’s start Get our hands dirty
Q? & A!
Apache Giraph
![Page 50: Apache Giraph: Large-scale graph processing done better](https://reader031.vdocuments.net/reader031/viewer/2022022804/5875874b1a28ab901c8b5001/html5/thumbnails/50.jpg)
Basic concepts Let’s start Get our hands dirty
Thank you for your attention
Contact us for any questions or problem
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Homework
https://github.com/manuelcoppotelli/giraph-homework
Apache Giraph