session 3: big graph analysis - cs.colostate.educs535/slides/week11-a-2.pdf•part 3: implementation...

24
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs Online GEAR presentation will be available on 4/6 You will have 3 days of discussion period on Piazza 4/6 ~ 4/8 CS535 Big Data | Computer Science | Colorado State University

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONSSESSION 3: BIG GRAPH ANALYSIS

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs• Online GEAR presentation will be available on 4/6• You will have 3 days of discussion period on Piazza

• 4/6 ~ 4/8

CS535 Big Data | Computer Science | Colorado State University

Page 2: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class• GraphX: Graph Processing in a Distributed Dataflow Framework

• Part 1: Introduction and Graph parallelism • Part 2: Distributed Graph Representation• Part 3: Implementation of Distributed Graph Processing

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph AnalysisLecture 2. Distributed Large Graph Analysis-II

GraphX: Graph Processing in a Distributed Dataflow Framework

CS535 Big Data | Computer Science | Colorado State University

Page 3: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

This material is built based on• Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014.

Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599-613).

• KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput.

• 48, 1 (1998), 96–129. • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming-

guide.html

CS535 Big Data | Computer Science | Colorado State University

Introduction• GraphX is a library built on top of the Apache Spark for graphs and graph-parallel

computation

• Introduces a Graph abstraction• Directed multigraph with properties attached to each vertex and edge

• Provides a set of graph operators• E.g. subgraph, JoinVertices, and aggregateMessages

• Provides an optimized variant of the Pregel API• Implements graph algorithms and builders

• PageRank• Connected Components• Triangle Counting

CS535 Big Data | Computer Science | Colorado State University

Page 4: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Computational Challenges• Graph processing systems outperform general-purpose distributed dataflow

frameworks with own specialized optimization schemes• E.g. Pregel, PowerGraph, BLAS, Kineograph

• Graphs are often only a part of the large analytics process• Combines graphs with unstructured and tabular data• Analytics pipelines are forced to compose multiple systems • Extra data movement and duplication• Fault tolerance

• Design of graph processing systems on top of general purpose distributed dataflow systems is needed

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph AnalysisLecture 2. Distributed Large Graph Analysis-II

GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Dataflow Model and Optimization Schemes for Graph Processing

CS535 Big Data | Computer Science | Colorado State University

Page 5: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

Dataflow Models - Traditional Network Programming

• Message-passing between nodes (e.g. MPI)

• Very difficult to do at scale• How to split the problem across nodes?• Network communication & data locality

• How to deal with failures? (inevitable at scale)• Stragglers?

• Node not failed but slow• Writing programs for each machine

• Rarely used in commodity datacenters!

CS535 Big Data | Computer Science | Colorado State University

Dataflow Models – Modern distributed dataflow models• Restrict the programming interface

• System can do more automatically

• Express jobs as graphs of high-level operators• System picks how to split each operator into tasks and where to run each task• Run parts multiple times for fault recovery• Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive…

• Examples of dataflow operators• join, map, groupby, … most of the operators introduced in the Apache Spark discussion

CS535 Big Data | Computer Science | Colorado State University

Page 6: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

Why did these graph processing systems evolve separately from distributed dataflow frameworks?• Early emphasis on single stage computation and on-disk processing

• Limited capability to handle iterative graph algorithms• Repeatedly and randomly access subsets of the graph• E.g. MapReduce

• Early distributed dataflow frameworks did not support fine-grained control over the data partitioning• Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control

over data partitioning

CS535 Big Data | Computer Science | Colorado State University

Optimization used in GraphX• Encoding graph as a collections

• Vertex-cut partitioning• Executing graph algorithms as the common dataflow operators• Join optimizations

• E.g. CSR indexing, join elimination and join-site specification• Materialized view maintenance

• Vertex mirroring and delta updates• Applying above techniques and provides a new set of the Spark dataflow operators for

graph processing• Reducing memory overhead and improve system performance

• Immutability GraphX reuses indices across graph and collection views over multiple iterations

CS535 Big Data | Computer Science | Colorado State University

Page 7: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

GEAR Session 3. Big Graph AnalysisLecture 2. Distributed Large Graph Analysis-II

GraphX: Graph Processing in a Distributed Dataflow Framework

The Property Graphs as Collections and Executing Graph Algorithms

CS535 Big Data | Computer Science | Colorado State University

Property Graph• User-defined properties with each vertex and edge• Meta-data• e.g. user profiles and time stamps• Program state• E.g. the PageRank of vertices or inferred affinities• Applicable for natural phenomena such as social networks and web graphs

• Often highly skewed• Power-law degree distributions• Orders of magnitude more edges than vertices

CS535 Big Data | Computer Science | Colorado State University

Page 8: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Transforming a Property Graph to a Pair of Collections

• Vertex collection

• Vertex properties (with a unique key: Vertex Identifier)

• Vertex Identifiers are 64-bit integer

• Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL)

• Edge collection

• Edge properties (with source and destination vertex identifiers)

• Having a pair of collection enables the system to compute graph algorithms with

existing dataflow operations

• Join: adding additional vertex properties

• Creating new collections: creating a new graph

• E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same

edge collection

CS535 Big Data | Computer Science | Colorado State University

The Graph-Parallel Abstraction (Discussed in W10-A)• Iterative local

transformations • E.g. PageRank algorithm

• Vertex program• Launches the vertex program

for each vertex and interacts with adjacent vertex programs through messages (e.g. pregel), or shared state (e.g. PowerGraph)

• Example with the PageRank algorithm

def PageRank(v: Id, msgs: List[Double]) { // Compute the message sumvar msgSum = 0for (m <- msgs) { msgSum += m } // Update the PageRank PR(v) = 0.15 + 0.85 * msgSum// Broadcast messages with new PR for (j <- OutNbrs(v)) {msg = PR(v) / NumLinks(v) send_msg(to=j, msg)

} // Check for termination if (converged(PR(v))) voteToHalt(v)

}

PageRank in Pregel

CS535 Big Data | Computer Science | Colorado State University

Page 9: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

The Graph-Parallel Abstraction (Discussed in W10-A)• Advantage

• Well-suited for iterative graph algorithms for the static neighborhood structure of the graph

• Disadvantage• It cannot express computation where disconnected vertices interact • It cannot process graph data that changes the graph structure in the course of the computation

CS535 Big Data | Computer Science | Colorado State University

The GAS Decomposition

• Gonzalez et al.1 observed that most vertex programs interact with neighboring vertices by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop

1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30.

CS535 Big Data | Computer Science | Colorado State University

Page 10: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Types of graph computation [1/3]• Gather: Your computation gathers information from neighboring vertices• e.g. authority value of the HITS algorithm• e.g. current PageRank value

CS535 Big Data | Computer Science | Colorado State University

Types of graph computation [2/3]• Apply: The vertex applies an update the vertex property• e.g. update the authority value with the sum of new authority values after normalizing

the value• e.g. Add passed PageRank values and normalize it and update the current PageRank

value

CS535 Big Data | Computer Science | Colorado State University

Page 11: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Types of graph computation [3/3]• Scatter: a vertex should send out information to neighboring vertices.

CS535 Big Data | Computer Science | Colorado State University

The GAS Decomposition• The GAS decomposition

• Splits vertex programs into three data-parallel stages • Gather• Apply• Scatter

def PageRank(v: Id, msgs: List[Double]) { // Compute the message sumvar msgSum = 0for (m <- msgs) { msgSum += m } // Update the PageRank PR(v) = 0.15 + 0.85 * msgSum// Broadcast messages with new PR for (j <- OutNbrs(v)) {msg = PR(v) / NumLinks(v) send_msg(to=j, msg)

} // Check for termination if (converged(PR(v))) voteToHalt(v)

}

CS535 Big Data | Computer Science | Colorado State University

Gather

Apply

Scatter

Page 12: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

The GAS Decomposition• pull-based model of message computation

• The system asks the vertex program for value of the message between adjacent vertices • Rather than the user sending messages directly from the vertex program

• Therefore, vertex-cut is suitable for this style of computation

• Limited communication pattern• Supports only between adjacent vertices

CS535 Big Data | Computer Science | Colorado State University

Graph Computation as Dataflow Ops.• The graph-parallel computation can be expressed as a sequence of join stages, group-

by stages and map operations• Join stage

• Vertex and edge properties are joined to form the triplets view• Consists of each edge and its corresponding source and destination vertex properties

• Group-by stage• The triplets are grouped by source or destination vertex to construct the neighborhood of each vertex

to construct the neighborhood of each vertex and compute aggregates• Gathers messages destined to the same vertex

• Map operation• Applies the message final results for the given vertex to update the vertex property

• Join operation• To distribute the values to the vertices

CS535 Big Data | Computer Science | Colorado State University

Page 13: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Discussions• Assume that you implement the

PageRank algorithm using three stages in GraphX. What stage will be applied for the line5?

a. Join stageb. GroupBy stagec. map operationsd. All of the above

0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum2: var msgSum = 03: for (m <- msgs) { msgSum += m } 4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum6: // Broadcast messages with new PR 7: for (j <- OutNbrs(v)) {8: msg = PR(v) / NumLinks(v) 9: send_msg(to=j, msg) 10: } 11: // Check for termination 12: if (converged(PR(v))) voteToHalt(v) 13: }

CS535 Big Data | Computer Science | Colorado State University

Discussions• Assume that you implement the

PageRank algorithm using three stages in GraphX. What stage will be applied for the line3?

a. Join stageb. GroupBy stagec. map operationsd. All of the above

0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum2: var msgSum = 03: for (m <- msgs) { msgSum += m } 4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum6: // Broadcast messages with new PR 7: for (j <- OutNbrs(v)) {8: msg = PR(v) / NumLinks(v) 9: send_msg(to=j, msg) 10: } 11: // Check for termination 12: if (converged(PR(v))) voteToHalt(v) 13: }

CS535 Big Data | Computer Science | Colorado State University

Page 14: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

The GAS Decomposition with GraphX• Gather

• GroupBy stage • Apply

• Map operation• Scatter

• Join stage

CS535 Big Data | Computer Science | Colorado State University

Triplets view• Each edge and its corresponding source and destination vertex properties

A

B

A

Vertices Edges

B A B

Triplets

CREATE VIEW triplets ASSELECT s.Id, d.Id, s.P, e.P, d.PFROM edges AS eJOIN vertices AS s JOIN vertices AS d ON e.srcId = s.Id AND e.dstId = d.Id

Constructing Triplets in SQL

CS535 Big Data | Computer Science | Colorado State University

Page 15: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

GraphX Graph Operators• Transform vertex and edge collections

• Graph Constructor• Logically binds a pair of vertex and edge property collections into a property graph• Verities integrity constrains – every vertex occurs only once and that edges do not link to missing

vertices• def Graph(v: Collection[(Id, V)], e: Collection[(Id, Id, E)])

• Collection views• Vertex and edges operators expose the graph’s vertex and edge property collections• Triplets operator returns the triplets view of the graph• def vertices: Collection[(Id, V)] • def edges: Collection[(Id, Id, E)] • def triplets: Collection[Triplet]

CS535 Big Data | Computer Science | Colorado State University

GraphX Graph Operators• Graph-parallel computation

• MapReduce Triplets operator encodes the two-stage process of graph-parallel computation• Composes the map and group-by dataflow operators on the triplets view• User-defined map function is applied to each triplet• Generates values and aggregates them at the destination vertex using user-defined binary

aggregation function • def mrTriplets(f: (Triplet) => M, sum: (M, M) => M): Collection[(Id, M)] • In SQLSELECT t.dstID, reduce(mapF(t)) AS msgSumFROM triplets AS t GROUP BY t.dstId

CS535 Big Data | Computer Science | Colorado State University

Page 16: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

GraphX Graph Operators

• Convenience functions • def mapV(f: (Id, V) => V): Graph[V, E]• def mapE(f: (Id, Id, E) => E): Graph[V, E] • def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V): Graph[V, E]• def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E): Graph[V, E] • def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean) : Graph[V, E] • def reverse: Graph[V, E] }

CS535 Big Data | Computer Science | Colorado State University

Example use of mrTripletsA B

ED

C

F

42 23

30

7519

16

AmapF( )=1BSource property 42

Target property 23Messageto vertex B

V id PropertyA 0B 2C ?D ?E ?F ?

Resultingvertices

Compute the number of older followersfor each user in a social network

val graph: Graph[User, Double]def mapUDF(t: Triplet[User, Double]) = ??? What will be your computation here?

def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)

CS535 Big Data | Computer Science | Colorado State University

Page 17: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

Example use of mrTripletsA B

ED

C

F

42 23

30

7519

16

AmapF( )=1BSource property 42

Target property 23Messageto vertex B

V id PropertyA 0B 2C 1D 1E 0F 3

Resultingvertices

Compute the number of older followersfor each user in a social network

val graph: Graph[User, Double]def mapUDF(t: Triplet[User, Double]) = if (t.src.age > t.dst.age) 1 else 0

def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)

CS535 Big Data | Computer Science | Colorado State University

Implementation of the Pregel abstraction using GraphX• Initializes the vertex properties

with an additional field to track active vertices

• While they are active, messages are computed using the mrTriplets operator

• Edge-parallel map operation• Message computation

• Commutative associated aggregation

def Pregel(g: Graph[V, E], vprog: (Id, V, M) => V, sendMsg: (Triplet) => M, gather: (M, M) => M): Collection[V] = {

// Set all vertices as activeg = g.mapV((id, v) => (v, halt=false))// Loop until convergencewhile (g.vertices.exists(v => !v.halt)) { // Compute the messagesval msgs: Collection[(Id, M)] =// Restrict to edges with active source g.subgraph(ePred=(s,d,sP,eP,dP)=>!sP.halt) // Compute messages .mrTriplets(sendMsg, gather)

// Receive messages and run vertex program g = g.leftJoinV(msgs).mapV(vprog) } return g.vertices

}

CS535 Big Data | Computer Science | Colorado State University

Page 18: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

GEAR Session 3. Big Graph AnalysisLecture 2. Distributed Large Graph Analysis-II

GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Representation of a Graph

CS535 Big Data | Computer Science | Colorado State University

Distributed Graph Representation• GraphX represents graphs internally as a pair of vertex and edge collections built on

the Spark RDD abstraction • Indexing and graph-specific partitioning as a layer on top of RDDs

1

2

3

4

56

1 2

Edge partition A

1 3

4 1

Edge partition B

4 5

1 5

Edge partition C

1 6

5 6

Edges

1

2

Vertex partition A

Vertices

3

1

1

1

4

5

Vertex partition B

6

1

1

0

Partition A

Routing Table

Partition B

A 1,2,3

B 1

C 1

A

B 4,5

C 5.6

Partition A

Partition B

Partition C

Bitm

askB

itmask

CS535 Big Data | Computer Science | Colorado State University

Page 19: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Vertices and Edges• Vertex collection is hash-partitioned by the vertex ids• Vertices are stored in a local hash index within each

partition• Bitmask stores the visibility of each vertex

• Soft deletions to promote index reuse• If vertex 5 and adjacent edges are restricted from the graph,

they are removed from the corresponding collection by updating the bitmasks

• Your computation can reuse this index

• Edges are divided into three edge partitions by applying a partition function • E.g. 2D partitioning

• Vertices are partitioned by vertex id

1 2Edge partition A

1 3

4 1Edge partition B

4 5

1 5Edge partition C

1 65 6

Edges

1

2

Vertex partition A

Vertices

3

1

1

1

4

5

Vertex partition B

6

1

1

0

Bitmask

Bitmask

CS535 Big Data | Computer Science | Colorado State University

Routing table• Encoding the edge partitions for each vertex• Join site information is stored in the routing table

Partition A

Routing Table

Partition B

A 1,2,3

B 1

C 1

A

B 4,5

C 5.6

CS535 Big Data | Computer Science | Colorado State University

Page 20: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Graph Partitioning: EdgePartition2D• Inspired by the multilevel k-way partitioning1

• 2D graph partitioning• Upper bound of 2 " − 1 on the vertex replication factor

• ,where n is the number of partitions

1KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput.48, 1 (1998), 96–129.

CS535 Big Data | Computer Science | Colorado State University

Graph Partitioning: EdgePartition2D• Consider a graph G = (V, E)

• ,where V is the set of vertices and E is the set of edges• Every vertex in V has a vertex identifier and a vertex property• Every edge in E has source and destination vertex identifiers and edge property

• Goal• Create n partitions of G such that:• The partitions should incur minimum communication• The workload should be balanced

CS535 Big Data | Computer Science | Colorado State University

Page 21: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 21

Step 1: Creating a partition table

If n is a perfect squarerows(# of rows) = !cols (# of columns) = ! If n is not a perfect square

rows = the floor value of (n + cols -1)cols = the ceiling of the decimal value of !

For example, if n = 27, cols = 6 and rows = 5The last column would have 3 rows

!

!

CS535 Big Data | Computer Science | Colorado State University

Step 2: Assigning vertices and edges

!

!

Vertex assignmentUsing elementary modular hash v%nVertices are equally distributed among the partitions

Edge assignment

The source vertex (src) is mapped on the columnscol = ((src x mixingPrime)% !, if n is a perfect squarecol = ((src x mixingPrime)% ( #

$%&'), otherwise,where mixingPrime is a large prime number to improve the balance of edge distributions

The destination vertices (des) is mapped on the rowsrow = ((des x mixingPrime)% !, if n is a perfect squarerow = ((des x mixingPrime)% ( #

$%&'), if n is not a perfect square and col < cols - 1

row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise

CS535 Big Data | Computer Science | Colorado State University

Page 22: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 22

Step 3: Storing edge properties

!

!

Storing Edge Properties(col x ! + row) if n is a perfect square(col x rows + rows) otherwise

Edge assignment

The source vertex (src) is mapped on the columnscol = ((src x mixingPrime)% !, if n is a perfect squarecol = ((src x mixingPrime)% ( #

$%&'), otherwise,where mixingPrime is a large prime number to improve the balance of edge distributions

The destination vertices (des) is mapped on the rowsrow = ((des x mixingPrime)% !, if n is a perfect squarerow = ((des x mixingPrime)% ( #

$%&'), if n is not a perfect square and col < cols - 1

row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise

CS535 Big Data | Computer Science | Colorado State University

Discussions• Let’s locate a set of edges using EdgePartition2D• {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)• Where will they be located?a. a single cellb. a single rowc. a single columnd. randomly dispersed

!

!

CS535 Big Data | Computer Science | Colorado State University

Page 23: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 23

Discussions• Let’s locate a set of edges using EdgePartition2D• {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)• Where will they be located?a. a single cellb. a single rowc. a single columnd. randomly dispersed

!

!

CS535 Big Data | Computer Science | Colorado State University

Understanding the effect of EdgePartition2D• Let’s locate an edge (vsrc, vdes) • All the edges where vsrc is the source vertex

• Would be placed in the same column, col• Example:• If vsrc = 9 and mixingPrime = 3 for the 25 (=n) partitions• (9 x 3)%5 = 2

• The actual cell will be determined by the destination vertex• If vdes is 2 and mixingPrime = 3 • (2 x 3)%5 = 1

• Therefore, the edge (vsrc, vdes) is stored in the partition 11 (the partition defined as the 2nd row and the 3rd column)

!

!

0 1 2 3. 4

CS535 Big Data | Computer Science | Colorado State University

Page 24: SESSION 3: BIG GRAPH ANALYSIS - cs.colostate.educs535/slides/week11-A-2.pdf•Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 24

Understanding the effect of EdgePartition2D• A vertex with the vertex id of v can be in any of

the cell in the column of (v x mixingPrime)% !• If it was a source vertex

• Similarly, a vertex with the vertex id of v can be in any of the cell in the raw of (v x mixingPrime)% !• If it was a destination vertex

• Can a vertex v be in any other cells except aforementioned set of cells?

• No!

!

!

0 1 2 3. 4

CS535 Big Data | Computer Science | Colorado State University

Understanding the effect of EdgePartition2D• Therefore, any edge containing v has to be

placed in any of ! + ! -1 = 2 ! - 1 partitions

• The upper bound on the vertex replication factor is 2 " - 1 • This is directly related to the communication cost to

synchronize the status of the vertex properties

!

!

0 1 2 3. 4

Naman Shah, Matthew Malensek, Harshil Shah, Shrideep Pallickara, and Sangmi Lee Pallickara, “Scalable Network Analytics for Characterization of Outbreak Influence in Voluminous Epidemiology Datasets,” Concurrency and Computation: Practice & Experience. John- Wiley. 2018Naman Shah, Harshil Shah, Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. “Network Analysis for Identifying and Characterizing Disease Outbreak Influence from Voluminous Epidemiology Data,”Proceedings of the IEEE International Conference on Big Data (IEEE BigData). Washington D.C., USA. 2016

CS535 Big Data | Computer Science | Colorado State University