powergraph

56
22.06.2015 DIMA – TU Berlin 1 Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs Igor Shevchenko Mentor: Sebastian Schelter

Upload: igor-shevchenko

Post on 16-Apr-2017

550 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerGraph

22.06.2015 DIMA – TU Berlin 1

Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin

http://www.dima.tu-berlin.de/

Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel

Computation on Natural Graphs

Igor Shevchenko

Mentor: Sebastian Schelter

Page 2: PowerGraph

22.06.2015 DIMA – TU Berlin 2

Agenda

1. Natural Graphs: Properties and Problems;

2. PowerGraph: Vertex Cut and Vertex Programs;

3. GAS Decomposition;

4. Vertex Cut Partitioning;

5. Delta Caching;

6. Applications and Evaluation;

Paper: Gonzalez at al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.

Page 3: PowerGraph

22.06.2015 DIMA – TU Berlin 3

■ Natural graphs are graphs derived from real-world or natural phenomena;

■ Graphs are big: billions of vertices and edges and rich metadata;

Natural graphs have

Power-Law Degree Distribution

Natural Graphs

Page 4: PowerGraph

22.06.2015 DIMA – TU Berlin 4

Power-Law Degree Distribution

(Andrei Broder et al. Graph structure in the web)

Page 5: PowerGraph

22.06.2015 DIMA – TU Berlin 5

■ We want to analyze natural graphs;

■ Essential for Data Mining and Machine Learning;

Goal

Identify influential people and information; Identify special nodes and communities; Model complex data dependencies;

Target ads and products; Find communities; Flow scheduling;

Page 6: PowerGraph

22.06.2015 DIMA – TU Berlin 6

■ Existing distributed graph computation systems

perform poorly on natural graphs (Gonzalez et al.

OSDI ’12);

■ The reason is presence of high degree vertices;

Problem

High Degree Vertices: Star-like motif

Page 7: PowerGraph

22.06.2015 DIMA – TU Berlin 7

Possible problems with high degree vertices:

■ Limited single-machine resources;

■ Work imbalance;

■ Sequential computation;

■ Communication costs;

■ Graph partitioning;

Applicable to:

■ Hadoop; GraphLab; Pregel (Piccolo);

Problem Continued

Page 8: PowerGraph

22.06.2015 DIMA – TU Berlin 8

■ High degree vertices can exceed the memory capacity of a single machine;

■ Store edge meta-data and adjacency information;

Problem: Limited Single-Machine Resources

Page 9: PowerGraph

22.06.2015 DIMA – TU Berlin 9

■ The power-law degree distribution can lead to significant work imbalance and frequency barriers;

■ For ex. with synchronous execution (Pregel):

Problem: Work Imbalance

Page 10: PowerGraph

22.06.2015 DIMA – TU Berlin 10

■ No parallelization of individual vertex-programs;

■ Edges are processed sequentially;

■ Locking does not scale well to high degree vertices (for ex. in GraphLab);

Problem: Sequential Computation

Sequentially process edges Asynchronous execution requires heavy locking

Page 11: PowerGraph

22.06.2015 DIMA – TU Berlin 11

■ Generate and send large amount of identical messages (for ex. in Pregel);

■ This results in communication asymmetry;

Problem: Communication Costs

Page 12: PowerGraph

22.06.2015 DIMA – TU Berlin 12

■ Natural graphs are difficult to partition;

■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication;

Problem: Graph Partitioning

Page 13: PowerGraph

22.06.2015 DIMA – TU Berlin 13

■ Natural graphs are difficult to partition;

■ Pregel and GraphLab use random (hashed) partitioning on natural graphs thus maximizing the network communication;

Expected edges that are cut

Examples:

■ 10 machines:

■ 100 machines:

Problem: Graph Partitioning Continued

= number of machines

90% of edges cut;

99% of edges cut;

Page 14: PowerGraph

22.06.2015 DIMA – TU Berlin 14

■ GraphLab and Pregel are not well suited for computations on natural graphs;

Reasons:

■ Challenges of high-degree vertices;

■ Low quality partitioning;

Solution:

■ PowerGraph new abstraction;

In Summary

Page 15: PowerGraph

22.06.2015 DIMA – TU Berlin 15

PowerGraph

Page 16: PowerGraph

22.06.2015 DIMA – TU Berlin 16

Two approaches for partitioning the graph in a distributed environment:

■ Edge Cut;

■ Vertex Cut;

Partition Techniques

Page 17: PowerGraph

22.06.2015 DIMA – TU Berlin 17

■ Used by Pregel and GraphLab abstractions;

■ Evenly assign vertices to machines;

Edge Cut

Page 18: PowerGraph

22.06.2015 DIMA – TU Berlin 18

■ Used by PowerGraph abstraction;

■ Evenly assign edged to machines;

Vertex Cut The strong point of the paper

4 edges 4 edges

Page 19: PowerGraph

22.06.2015 DIMA – TU Berlin 19

Think like a Vertex

[Malewicz et al. SIGMOD’10]

User-defined Vertex-Program:

1. Runs on each vertex;

2. Interactions are constrained by graph structure;

Pregel and GraphLab also use this concept, where

parallelism is achieved by running multiple vertex

programs simultaneously;

Vertex Programs

Page 20: PowerGraph

22.06.2015 DIMA – TU Berlin 20

■ Vertex cut distributes a single vertex-program across several machines;

■ Allows to parallelize high-degree vertices;

GAS Decomposition The strong point of the paper

Page 21: PowerGraph

22.06.2015 DIMA – TU Berlin 21

Generalize the vertex-program into three phases:

1. Gather

Accumulate information about neighborhood;

2. Apply

Apply accumulated value to center vertex;

3. Scatter

Update adjacent edges and vertices;

GAS Decomposition

Gather, Apply and Scatter are user-defined functions;

The strong point of the paper

Page 22: PowerGraph

22.06.2015 DIMA – TU Berlin 22

■ Executed on the edges in parallel;

■ Accumulate information about neighborhood;

Gather Phase

Page 23: PowerGraph

22.06.2015 DIMA – TU Berlin 23

■ Executed on the central vertex;

■ Apply accumulated value to center vertex;

Apply Phase

Page 24: PowerGraph

22.06.2015 DIMA – TU Berlin 24

■ Executed on the neighboring vertices in parallel;

■ Update adjacent edges and vertices;

Scatter Phase

Page 25: PowerGraph

22.06.2015 DIMA – TU Berlin 25

■ Vertex-programs that are written using GAS

decomposition will automatically scale to several

machines; How does it work?

GAS Decomposition

Page 26: PowerGraph

22.06.2015 DIMA – TU Berlin 26

GAS in a Distributed Environment

Page 27: PowerGraph

22.06.2015 DIMA – TU Berlin 27

■ Case with 2 machines;

GAS in a Distributed Environment

Page 28: PowerGraph

22.06.2015 DIMA – TU Berlin 28

■ Compute partial sums on each machine;

Gather Phase

Page 29: PowerGraph

22.06.2015 DIMA – TU Berlin 29

■ Send partial sum to the master machine;

■ Master machine computes the total sum;

Gather Phase

Page 30: PowerGraph

22.06.2015 DIMA – TU Berlin 30

■ Apply accumulated value to center vertex;

■ Replicate value to the mirrors;

Apply Phase

Page 31: PowerGraph

22.06.2015 DIMA – TU Berlin 31

■ Update adjacent edges and vertices;

■ Initiate neighboring vertex-programs if necessary;

Scatter Phase

Page 32: PowerGraph

22.06.2015 DIMA – TU Berlin 32

■ During the Gather Phase the partial results are

combined using commutative and associative

user-defined SUM operation;

■ Examples:

sum(a, b): return a + b

sum(a, b): return union(a, b)

sum(a, b): return min(a, b)

■ Also a requirement for Pregel combiners;

■ What if not commutative and associative?

SUM Operation

Page 33: PowerGraph

22.06.2015 DIMA – TU Berlin 33

■ If not commutative and associative sum;

■ Send each edge data to the master machine;

■ Increases communication amount on Gather:

Gather Phase: no partial sums

Page 34: PowerGraph

22.06.2015 DIMA – TU Berlin 34

Vertex Cut Partitioning

The strong point of the paper

Page 35: PowerGraph

22.06.2015 DIMA – TU Berlin 35

Three distributed approaches for Vertex Cut:

■ Random Edge Placement;

■ Coordinated Greedy Edge Placement;

■ Oblivious Greedy Edge Placemen;

Vertex Cut Partitioning

Minimize machines

spanned by each vertex

Minimize communication

and storage overhead =

Page 36: PowerGraph

22.06.2015 DIMA – TU Berlin 36

■ Randomly assign edges to machines;

Random Edge Placement

Page 37: PowerGraph

22.06.2015 DIMA – TU Berlin 37

Random Edge Placement

■ Randomly assign edges to machines;

Page 38: PowerGraph

22.06.2015 DIMA – TU Berlin 38

Random Edge Placement

■ Randomly assign edges to machines;

■ Edge data is uniquely assigned to one machine

Page 39: PowerGraph

22.06.2015 DIMA – TU Berlin 39

■ Only 3 network communication channels;

■ Can predict network communication usage;

■ Significantly less communication comparing to the Edge Cut graph placement;

■ Can improve upon random placement!

Communication Overhead

Page 40: PowerGraph

22.06.2015 DIMA – TU Berlin 40

■ Place edges on machines which already has the vertices in that edge;

Greedy Edge Placement

Page 41: PowerGraph

22.06.2015 DIMA – TU Berlin 41

■ If several choices are possible, assign to the least loaded machine;

Greedy Edge Placement

Page 42: PowerGraph

22.06.2015 DIMA – TU Berlin 42

■ Greedy Edge Placement is de-randomization;

■ Minimizes the number of machines spanned;

Coordinated Greedy Edge Placement:

■ Requires coordination to place each edge;

■ Maintains global distributed placement table;

■ Slower but produces higher quality cuts;

Oblivious Greedy Edge Placement:

■ Approx. greedy objective without coordination;

■ Faster but produces lower quality cuts;

Greedy Edge Placement

Page 43: PowerGraph

22.06.2015 DIMA – TU Berlin 43

■ Twitter Follower Graph: 41M vertices, 1.4B edges;

■ Oblivious Greedy Edge Placement balances cost (replication factor) and construction time;

Vertex Cut Partitioning: Comparison

Page 44: PowerGraph

22.06.2015 DIMA – TU Berlin 44

■ Greedy Edge Placement improves computation performance;

Vertex Cut Partitioning: Comparison

Page 45: PowerGraph

22.06.2015 DIMA – TU Berlin 45

Delta Caching

Execution Modes

Page 46: PowerGraph

22.06.2015 DIMA – TU Berlin 46

■ Vertex-program can be triggered in response to a change only in a few of its neighbors;

■ In response Gather Phase will accumulate information about the all neighborhood;

Delta Caching The strong point of the paper

Page 47: PowerGraph

22.06.2015 DIMA – TU Berlin 47

■ Accelerate the process by caching neighborhood accumulators from previous gather phase;

Delta Caching The strong point of the paper

Page 48: PowerGraph

22.06.2015 DIMA – TU Berlin 48

Delta Caching can speed up:

■ Gather Phase;

■ Scatter Phase;

Requires Abelian Group;

■ sum (+)

■ inverse (−)

Examples:

■ Page Rank – applicable;

■ Graph Coloring – not applicable;

Delta Caching

Commutative and associative

The strong point of the paper

Page 49: PowerGraph

22.06.2015 DIMA – TU Berlin 49

Supports three execution modes:

■ Synchronous: Bulk-Synchronous GAS Phases;

■ Asynchronous: Interleave GAS Phases;

■ Asynchronous Serializable: Prevent neighboring

vertices to run simultaneously;

Different tradeoffs:

■ Algorithm performance;

■ System performance;

■ Determinism;

Execution Modes

Page 50: PowerGraph

22.06.2015 DIMA – TU Berlin 50

Evaluation

Page 51: PowerGraph

22.06.2015 DIMA – TU Berlin 51

PowerGraph on the natural graphs shows:

■ Reduced network communication;

■ Reduced runtime;

■ Reduced storage;

On many examples

Evaluation

PageRank on the Twitter Follower Graph (40M Users, 1.4 Billion Links)

Page 52: PowerGraph

22.06.2015 DIMA – TU Berlin 52

■ Collaborative Filtering

Alternating Least Squares

Stochastic Gradient Descent

SVD

Non-negative MF

■ Statistical Inference

Loopy Belief Propagation

Max-Product Linear Programs

Gibbs Sampling

Applicability

■ Graph Analytics

PageRank

Triangle Counting

Shortest Path

Graph Coloring

K-core Decomposition

■ Computer Vision

Image stitching

■ Language Modeling

LDA

Page 53: PowerGraph

22.06.2015 DIMA – TU Berlin 53

■ Vertex Cut;

■ GAS Decomposition;

■ Delta Caching;

■ Three modes of execution;

Synchronous;

Asynchronous;

Asynchronous + Serializable;

Strong Points of the Paper

Page 54: PowerGraph

22.06.2015 DIMA – TU Berlin 54

■ “In all cases the system is entirely symmetric with no single coordinating instance or scheduler”;

How do they deal with Synchronous execution?

Evaluation mess:

■ Evaluated Synchronous execution using PageRank;

■ Evaluated Asynchronous execution using GraphColoring;

■ Evaluated Asynchronous+S execution using GraphColoring;

■ Compared PowerGraph with published results again using PageRank, Triangle Counting but not GraphColoring;

■ Oblivious Greedy Edge Placement is poorly explained;

Weak Points of the Paper

Page 55: PowerGraph

22.06.2015 DIMA – TU Berlin 55

■ Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2012);

■ Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large-scale graph processing. In SIGMOD (2010).

■ Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. in PVLDB (2012).

■ http://graphlab.org

References

Page 56: PowerGraph

22.06.2015 DIMA – TU Berlin 56

Questions?

1. Natural Graphs: Properties and Problems;

2. PowerGraph: Vertex Cut and Vertex Programs;

3. GAS Decomposition;

4. Vertex Cut Partitioning;

5. Delta Caching;

6. Applications and Evaluation;

Paper: Gonzalez at al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.