large scale graph processing

Large Scale Graph Processing Deepankar Patra IIT Madras

Post on 18-Oct-2014




2 download




Page 1: Large scale graph processing

Large Scale Graph ProcessingDeepankar Patra

IIT Madras

Page 2: Large scale graph processing


Running graph algorithms(e.g. Shortest path, connected components, finding diameter etc) on huge graphs(Terabyte or more Sized)

Page 3: Large scale graph processing


Node/Vertex Edge

Page 4: Large scale graph processing

Example Graph Algorithm

● Shortest Path Algorithm

Source Vertex Destination Vertex

Page 5: Large scale graph processing


Lot of machine learning algorithms require graph computations and in the real world the input for these are huge, which cannot fit in one machine.

Page 6: Large scale graph processing

Real World? 

Big Graphs:● Social Networks● Biological Networks● Mobile Call Networks● Citation Networks● World Wide Web● Geographic Pathways● Customer merchant graphs(Amazon, Ebay)

Page 7: Large scale graph processing

Facebook Friends Graph


Page 8: Large scale graph processing

Machine Learning Algorithms?

● Recommendation● PageRank● Web search● Cyber security● Fraud detection● Clustering● Shortest Path Calculation

Page 9: Large scale graph processing

Graph Algorithms Typically Involve

● Performing computations at each node based on node features, edge features, and local link structure.

● Propagating computations: “traversing” the graph

Page 10: Large scale graph processing



Page 11: Large scale graph processing

Why not MapReduce?

● Represent graphs as adjacency lists● Perform local computations in mapper ● Pass along partial results via outlinks, keyed by destination node ● Perform aggregation in reducer on inlinks to a node ● Iterate until convergence: controlled by external “driver” ● Don’t forget to pass the graph structure between iterations

Page 12: Large scale graph processing

Why not Spark?

● Spark provides GraphX library for graph & machine learning algorithms.

● But still it is not designed specifically for graph algorithms.

● So, no optimization will be available which are applicable for graphs only.

Page 13: Large scale graph processing

PREGEL, Google, 2010

● Basic idea: “think like a vertex”

● Based on Bulk Synchronous Parallel(BSP) Model

● Provides scalability

● Provides fault tolerance

● Provides flexibility to express arbitrary graph algorithms

Page 14: Large scale graph processing

How does it work?

● Master/Worker architecture ● Each worker is assigned a subset of a directed graph’s vertices

● Vertex-centric model. Each vertex has: ● An arbitrary “value” that can be get/set.

● List of messages sent to it ● List of outgoing edges (edges have a value too)

● A binary state (active/inactive)

Page 15: Large scale graph processing

Graph Parititioning

Worker 1

Worker 3

Worker 2

Page 16: Large scale graph processing

Pregel execution model

Master initiates synchronous iterations (called a “superstep”), where at every superstep:

● Workers asynchronously execute a user function on all of its vertices

● Vertices can receive messages sent to it in the last superstep

● Vertices can modify their value, modify values of edges, change the topology of the graph (add/remove vertices or edges)

● Vertices can send messages to other vertices to be received in the next superstep

● Vertices can “vote to halt”

● Execution stops when all vertices have voted to halt and no vertices have messages.

● Vote to halt trumped by non-empty message queue

Page 17: Large scale graph processing

Pregel Graph Processing

Page 18: Large scale graph processing

Page Rank

PageRank is a link analysis algorithm that is used to determine the importance of a documentbased on the number of references to it and the importance of the source documents themselves.

Page 19: Large scale graph processing

Page Rank

A = A given pageT1 .... Tn = Pages that point to page A (citations)d = Damping factor between 0 and 1 (usually kept as0.85)C(T) = number of links going out of TPR(A) = the PageRank of page A

Page 20: Large scale graph processing

Page Rank

Class PageRankVertex: public Vertex<double, void, double> {public:virtual void Compute(MessageIterator* msgs) {if (superstep() >= 1) {

double sum = 0;for (; !msgs->done(); msgs->Next())

sum += msgs->Value();*MutableValue() = 0.15 + 0.85 * sum;

}if (supersteps() < 30) {

const int64 n = GetOutEdgeIterator().size();SendMessageToAllNeighbors(GetValue() / n);

} else {


Page 21: Large scale graph processing

Open Source

PREGEL was a research paper, Google didn't expose any open source implementation.

As a result lots of open source implementations came up and they keep on improving the basic Pregel model. Most notable two are:

a) Apache Giraph, started, maintained and used mainly by facebook

b) CMU's GraphLab(now it is a company by itself)

Page 22: Large scale graph processing

One Example: GraphLab

● GraphLab is currently is the best one

● GraphLab modified the partitioning strategy to reduce network overhead message transfer among workers

● GraphLab has a rich library of machine learning algorithms and its growing

Page 23: Large scale graph processing


● Pregel: A System for Large-Scale Graph Processing

● PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

● GraphX: A Resilient Distributed Graph System on Spark

