large scale graph processing - x-stream and graphx · graph-parallel processing model 34/67....
TRANSCRIPT
![Page 3: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/3.jpg)
Where Are We?
2 / 67
![Page 4: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/4.jpg)
3 / 67
![Page 5: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/5.jpg)
Graph Algorithms Challenges
I Difficult to extract parallelism based on partitioning of the data.
I Difficult to express parallelism based on partitioning of computation.
4 / 67
![Page 6: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/6.jpg)
Think Like an Edge
5 / 67
![Page 7: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/7.jpg)
Motivation
Could we compute big graphs on a single machine?
6 / 67
![Page 8: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/8.jpg)
Vertex-Centric
I Vertex-centric gather-scatter: iterates over vertices
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
7 / 67
![Page 9: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/9.jpg)
Vertex-Centric Breadth First Search (1/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
8 / 67
![Page 10: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/10.jpg)
Vertex-Centric Breadth First Search (2/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
9 / 67
![Page 11: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/11.jpg)
Vertex-Centric Breadth First Search (3/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
10 / 67
![Page 12: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/12.jpg)
Vertex-Centric Breadth First Search (4/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
11 / 67
![Page 13: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/13.jpg)
Vertex-Centric Breadth First Search (5/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
12 / 67
![Page 14: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/14.jpg)
X-Stream
13 / 67
![Page 15: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/15.jpg)
X-Stream
I Could we process massive graphs on a single machine?
I X-Stream makes graph edges accesses sequential.
I Edge-centric scatter-gather model.
14 / 67
![Page 16: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/16.jpg)
X-Stream
I Could we process massive graphs on a single machine?
I X-Stream makes graph edges accesses sequential.
I Edge-centric scatter-gather model.
14 / 67
![Page 17: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/17.jpg)
X-Stream
I Could we process massive graphs on a single machine?
I X-Stream makes graph edges accesses sequential.
I Edge-centric scatter-gather model.
14 / 67
![Page 18: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/18.jpg)
I Disk-based processing• Graph traversal = random access• Random access is inefficient for storage
Eiko Y., and Roy A., “Scale-up Graph Processing: A Storage-centric View”, 2013.
15 / 67
![Page 19: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/19.jpg)
Vertex-Centric vs. Edge-Centric Programming Model (1/2)
I Vertex-centric gather-scatter: iterates over vertices
I Edge-centric gather-scatter: iterates over edges
16 / 67
![Page 20: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/20.jpg)
Vertex-Centric vs. Edge-Centric Programming Model (1/2)
I Vertex-centric gather-scatter: iterates over vertices
I Edge-centric gather-scatter: iterates over edges
16 / 67
![Page 21: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/21.jpg)
Vertex-Centric vs. Edge-Centric Programming Model (2/2)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
17 / 67
![Page 22: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/22.jpg)
Vertex-Centric vs. Edge-Centric Programming Model (2/2)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
17 / 67
![Page 23: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/23.jpg)
Vertex-Centric Breadth First Search (1/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
18 / 67
![Page 24: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/24.jpg)
Vertex-Centric Breadth First Search (2/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
19 / 67
![Page 25: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/25.jpg)
Vertex-Centric Breadth First Search (3/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
20 / 67
![Page 26: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/26.jpg)
Vertex-Centric Breadth First Search (4/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
21 / 67
![Page 27: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/27.jpg)
Vertex-Centric Breadth First Search (5/5)
Until convergence {
// the scatter phase
for all vertices v that need to scatter updates
send updates over outgoing edges of v
// the gather phase
for all vertices v that have updates
apply updates from inbound edges of v
}
22 / 67
![Page 28: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/28.jpg)
Edge-Centric Breadth First Search (1/5)
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
23 / 67
![Page 29: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/29.jpg)
Edge-Centric Breadth First Search (2/5)
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
24 / 67
![Page 30: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/30.jpg)
Edge-Centric Breadth First Search (3/5)
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
25 / 67
![Page 31: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/31.jpg)
Edge-Centric Breadth First Search (4/5)
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
26 / 67
![Page 32: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/32.jpg)
Edge-Centric Breadth First Search (5/5)
Until convergence {
// the scatter phase
for all edges e
send update over e
// the gather phase
for all edgaes e that have updates
apply update to e.destination
}
27 / 67
![Page 33: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/33.jpg)
Vertex-Centric vs. Edge-Centric Tradeoff
I Vertex-centric scatter-gather: EdgeDataRandomAccessBandwidth
I Edge-centric scatter-gather: Scatters×EdgeDataSequentialAccessBandwidth
I Sequential Access Bandwidth � Random Access Bandwidth.
I Few scatter gather iterations for real world graphs.
28 / 67
![Page 34: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/34.jpg)
Vertex-Centric vs. Edge-Centric Tradeoff
I Vertex-centric scatter-gather: EdgeDataRandomAccessBandwidth
I Edge-centric scatter-gather: Scatters×EdgeDataSequentialAccessBandwidth
I Sequential Access Bandwidth � Random Access Bandwidth.
I Few scatter gather iterations for real world graphs.
28 / 67
![Page 35: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/35.jpg)
Vertex-Centric vs. Edge-Centric Tradeoff
I Vertex-centric scatter-gather: EdgeDataRandomAccessBandwidth
I Edge-centric scatter-gather: Scatters×EdgeDataSequentialAccessBandwidth
I Sequential Access Bandwidth � Random Access Bandwidth.
I Few scatter gather iterations for real world graphs.
28 / 67
![Page 36: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/36.jpg)
Streaming Partitions (1/4)
I Problem: still have random access to vertex set.
SolutionPartition the graph into streaming partitions.
29 / 67
![Page 37: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/37.jpg)
Streaming Partitions (1/4)
I Problem: still have random access to vertex set.
SolutionPartition the graph into streaming partitions.
29 / 67
![Page 38: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/38.jpg)
Streaming Partitions (2/4)
30 / 67
![Page 39: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/39.jpg)
Streaming Partitions (2/4)
30 / 67
![Page 40: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/40.jpg)
Streaming Partitions (3/4)
I A streaming partition consists of: a vertex set, an edge list, and an update list.
I The vertex set: a subset of the vertex set of the graph that fits into the memory.• Vertex sets are mutually disjoint.• Their union equals the vertex set of the entire graph.
I The edge list: all edges whose source vertex is in the partition’s vertex set.
I The update list: all updates whose destination vertex is in the partition’s vertex set.
31 / 67
![Page 41: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/41.jpg)
Streaming Partitions (3/4)
I A streaming partition consists of: a vertex set, an edge list, and an update list.
I The vertex set: a subset of the vertex set of the graph that fits into the memory.• Vertex sets are mutually disjoint.• Their union equals the vertex set of the entire graph.
I The edge list: all edges whose source vertex is in the partition’s vertex set.
I The update list: all updates whose destination vertex is in the partition’s vertex set.
31 / 67
![Page 42: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/42.jpg)
Streaming Partitions (3/4)
I A streaming partition consists of: a vertex set, an edge list, and an update list.
I The vertex set: a subset of the vertex set of the graph that fits into the memory.• Vertex sets are mutually disjoint.• Their union equals the vertex set of the entire graph.
I The edge list: all edges whose source vertex is in the partition’s vertex set.
I The update list: all updates whose destination vertex is in the partition’s vertex set.
31 / 67
![Page 43: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/43.jpg)
Streaming Partitions (3/4)
I A streaming partition consists of: a vertex set, an edge list, and an update list.
I The vertex set: a subset of the vertex set of the graph that fits into the memory.• Vertex sets are mutually disjoint.• Their union equals the vertex set of the entire graph.
I The edge list: all edges whose source vertex is in the partition’s vertex set.
I The update list: all updates whose destination vertex is in the partition’s vertex set.
31 / 67
![Page 44: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/44.jpg)
Streaming Partitions (4/4)
// Scatter phase:
for each streaming_partition p
read in vertex set of p
for each edge e in edge list of p
append update to Uout
// shuffle phase:
for each update u in Uout
p = partition containing target of u
append u to Uin(p)
destroy Uout
//gather phase:
for each streaming_partition p
read in vertex set of p
for each update u in Uin(p)
edge_gather(u)
destroy Uin(p)
32 / 67
![Page 45: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/45.jpg)
Streaming Partitions (4/4)
// Scatter phase:
for each streaming_partition p
read in vertex set of p
for each edge e in edge list of p
append update to Uout
// shuffle phase:
for each update u in Uout
p = partition containing target of u
append u to Uin(p)
destroy Uout
//gather phase:
for each streaming_partition p
read in vertex set of p
for each update u in Uin(p)
edge_gather(u)
destroy Uin(p)
32 / 67
![Page 46: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/46.jpg)
Streaming Partitions (4/4)
// Scatter phase:
for each streaming_partition p
read in vertex set of p
for each edge e in edge list of p
append update to Uout
// shuffle phase:
for each update u in Uout
p = partition containing target of u
append u to Uin(p)
destroy Uout
//gather phase:
for each streaming_partition p
read in vertex set of p
for each update u in Uin(p)
edge_gather(u)
destroy Uin(p)
32 / 67
![Page 47: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/47.jpg)
Think Like a Table
33 / 67
![Page 48: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/48.jpg)
Graph-Parallel Processing Model
34 / 67
![Page 49: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/49.jpg)
Data-Parallel vs. Graph-Parallel Computation
35 / 67
![Page 50: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/50.jpg)
Motivation (2/3)
I Graph-parallel computation: restricting the types of computation to achieve perfor-mance.
I The same restrictions make it difficult and inefficient to express many stages in atypical graph-analytics pipeline.
36 / 67
![Page 51: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/51.jpg)
Motivation (2/3)
I Graph-parallel computation: restricting the types of computation to achieve perfor-mance.
I The same restrictions make it difficult and inefficient to express many stages in atypical graph-analytics pipeline.
36 / 67
![Page 52: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/52.jpg)
Motivation (3/3)
37 / 67
![Page 53: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/53.jpg)
Motivation (3/3)
37 / 67
![Page 54: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/54.jpg)
Think Like a Table
I Unifies data-parallel and graph-parallel systems.
I Tables and Graphs are composable views of the same physical data.
38 / 67
![Page 55: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/55.jpg)
GraphX
39 / 67
![Page 56: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/56.jpg)
GraphX
I GraphX is the library to perform graph-parallel processing in Spark.
I In-memory caching.
I Lineage-based fault tolerance.
40 / 67
![Page 57: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/57.jpg)
The Property Graph Data Model
I Spark represent graph structured data as a property graph.
I It is logically represented as a pair of vertex and edge property collections.• VertexRDD and EdgeRDD
// VD: the type of the vertex attribute
// ED: the type of the edge attribute
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
41 / 67
![Page 58: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/58.jpg)
The Vertex Collection
I VertexRDD: contains the vertex properties keyed by the vertex ID.
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
// VD: the type of the vertex attribute
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
42 / 67
![Page 59: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/59.jpg)
The Edge Collection
I EdgeRDD: contains the edge properties keyed by the source and destination vertexIDs.
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
// ED: the type of the edge attribute
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
43 / 67
![Page 60: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/60.jpg)
The Triplet Collection
I The triplets collection consists of each edge and its corresponding source and desti-nation vertex properties.
I It logically joins the vertex and edge properties: RDD[EdgeTriplet[VD, ED]].
I The EdgeTriplet class extends the Edge class by adding the srcAttr and dstAttr
members, which contain the source and destination properties respectively.
44 / 67
![Page 61: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/61.jpg)
The Triplet Collection
I The triplets collection consists of each edge and its corresponding source and desti-nation vertex properties.
I It logically joins the vertex and edge properties: RDD[EdgeTriplet[VD, ED]].
I The EdgeTriplet class extends the Edge class by adding the srcAttr and dstAttr
members, which contain the source and destination properties respectively.
44 / 67
![Page 62: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/62.jpg)
The Triplet Collection
I The triplets collection consists of each edge and its corresponding source and desti-nation vertex properties.
I It logically joins the vertex and edge properties: RDD[EdgeTriplet[VD, ED]].
I The EdgeTriplet class extends the Edge class by adding the srcAttr and dstAttr
members, which contain the source and destination properties respectively.
44 / 67
![Page 63: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/63.jpg)
Building a Property Graph
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), Edge(5L, 1L, "-")))
val defaultUser = ("John Doe", "Missing")
val graph: Graph[(String, String), String] = Graph(users, relationships, defaultUser)
45 / 67
![Page 64: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/64.jpg)
Building a Property Graph
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), Edge(5L, 1L, "-")))
val defaultUser = ("John Doe", "Missing")
val graph: Graph[(String, String), String] = Graph(users, relationships, defaultUser)
45 / 67
![Page 65: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/65.jpg)
Building a Property Graph
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), Edge(5L, 1L, "-")))
val defaultUser = ("John Doe", "Missing")
val graph: Graph[(String, String), String] = Graph(users, relationships, defaultUser)
45 / 67
![Page 66: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/66.jpg)
Building a Property Graph
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), Edge(5L, 1L, "-")))
val defaultUser = ("John Doe", "Missing")
val graph: Graph[(String, String), String] = Graph(users, relationships, defaultUser)
45 / 67
![Page 67: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/67.jpg)
Graph Operators
I Information about the graph
I Property operators
I Structural operators
I Joins
I Aggregation
I Iterative computation
I ...
46 / 67
![Page 68: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/68.jpg)
Information About The Graph (1/2)
I Information about the graph
val numEdges: Long
val numVertices: Long
val inDegrees: VertexRDD[Int]
val outDegrees: VertexRDD[Int]
val degrees: VertexRDD[Int]
I Views of the graph as collections
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD, ED]]
47 / 67
![Page 69: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/69.jpg)
Information About The Graph (1/2)
I Information about the graph
val numEdges: Long
val numVertices: Long
val inDegrees: VertexRDD[Int]
val outDegrees: VertexRDD[Int]
val degrees: VertexRDD[Int]
I Views of the graph as collections
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD, ED]]
47 / 67
![Page 70: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/70.jpg)
Information About The Graph (2/2)
// Constructed from above
val graph: Graph[(String, String), String]
// Count all users which are postdocs
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count
// Count all the edges where src > dst
graph.edges.filter(e => e.srcId > e.dstId).count
48 / 67
![Page 71: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/71.jpg)
Information About The Graph (2/2)
// Constructed from above
val graph: Graph[(String, String), String]
// Count all users which are postdocs
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count
// Count all the edges where src > dst
graph.edges.filter(e => e.srcId > e.dstId).count
48 / 67
![Page 72: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/72.jpg)
Information About The Graph (2/2)
// Constructed from above
val graph: Graph[(String, String), String]
// Count all users which are postdocs
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count
// Count all the edges where src > dst
graph.edges.filter(e => e.srcId > e.dstId).count
48 / 67
![Page 73: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/73.jpg)
Property Operators
I Transform vertex and edge attributes
I Each of these operators yields a new graph with the vertex or edge properties modifiedby the user defined map function.
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
val relations: RDD[String] = graph.triplets.map(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
relations.collect.foreach(println)
val newGraph = graph.mapTriplets(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
newGraph.edges.collect.foreach(println)
49 / 67
![Page 74: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/74.jpg)
Property Operators
I Transform vertex and edge attributes
I Each of these operators yields a new graph with the vertex or edge properties modifiedby the user defined map function.
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
val relations: RDD[String] = graph.triplets.map(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
relations.collect.foreach(println)
val newGraph = graph.mapTriplets(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
newGraph.edges.collect.foreach(println)
49 / 67
![Page 75: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/75.jpg)
Property Operators
I Transform vertex and edge attributes
I Each of these operators yields a new graph with the vertex or edge properties modifiedby the user defined map function.
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
val relations: RDD[String] = graph.triplets.map(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
relations.collect.foreach(println)
val newGraph = graph.mapTriplets(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
newGraph.edges.collect.foreach(println)
49 / 67
![Page 76: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/76.jpg)
Structural Operators
I reverse returns a new graph with all the edge directions reversed.
I subgraph takes vertex/edge predicates and returns the graph containing only thevertices/edges that satisfy the given predicate.
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean, vpred: (VertexId, VD) => Boolean):
Graph[VD, ED]
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
validGraph.vertices.collect.foreach(println)
50 / 67
![Page 77: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/77.jpg)
Structural Operators
I reverse returns a new graph with all the edge directions reversed.
I subgraph takes vertex/edge predicates and returns the graph containing only thevertices/edges that satisfy the given predicate.
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean, vpred: (VertexId, VD) => Boolean):
Graph[VD, ED]
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
validGraph.vertices.collect.foreach(println)
50 / 67
![Page 78: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/78.jpg)
Join Operators
I joinVertices joins the vertices with the input RDD.
• Returns a new graph with the vertex properties obtained by applying the user definedmap function to the result of the joined vertices.
• Vertices without a matching value in the RDD retain their original value.
def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD): Graph[VD, ED]
val rdd: RDD[(VertexId, String)] = sc.parallelize(Array((3L, "phd")))
val joinedGraph = graph.joinVertices(rdd)((id, user, role) => (user._1, role + " " + user._2))
joinedGraph.vertices.collect.foreach(println)
51 / 67
![Page 79: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/79.jpg)
Join Operators
I joinVertices joins the vertices with the input RDD.• Returns a new graph with the vertex properties obtained by applying the user definedmap function to the result of the joined vertices.
• Vertices without a matching value in the RDD retain their original value.
def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD): Graph[VD, ED]
val rdd: RDD[(VertexId, String)] = sc.parallelize(Array((3L, "phd")))
val joinedGraph = graph.joinVertices(rdd)((id, user, role) => (user._1, role + " " + user._2))
joinedGraph.vertices.collect.foreach(println)
51 / 67
![Page 80: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/80.jpg)
Aggregation (1/2)
I aggregateMessages applies a user defined sendMsg function to each edge tripletin the graph and then uses the mergeMsg function to aggregate those messages attheir destination vertex.
def aggregateMessages[Msg: ClassTag](
sendMsg: EdgeContext[VD, ED, Msg] => Unit, // map
mergeMsg: (Msg, Msg) => Msg, // reduce
tripletFields: TripletFields = TripletFields.All):
VertexRDD[Msg]
52 / 67
![Page 81: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/81.jpg)
Aggregation (2/2)
// count and list the name of friends of each user
val profs: VertexRDD[(Int, String)] = validUserGraph.aggregateMessages[(Int, String)](
// map
triplet => {
triplet.sendToDst((1, triplet.srcAttr._1))
},
// reduce
(a, b) => (a._1 + b._1, a._2 + " " + b._2)
)
profs.collect.foreach(println)
53 / 67
![Page 82: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/82.jpg)
Iterative Computation (1/9)
54 / 67
![Page 83: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/83.jpg)
Iterative Computation (2/9)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
55 / 67
![Page 84: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/84.jpg)
Iterative Computation (3/9)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
56 / 67
![Page 85: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/85.jpg)
Iterative Computation (4/9)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
57 / 67
![Page 86: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/86.jpg)
Iterative Computation (5/9)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
58 / 67
![Page 87: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/87.jpg)
Iterative Computation (6/9)
I pregel takes two argument lists: graph.pregel(list1)(list2).
I The first list contains configuration parameters• The initial message, the maximum number of iterations, and the edge direction in
which to send messages.
I The second list contains the user defined functions.• Gather: mergeMsg, Apply: vprog, Scatter: sendMsg
def pregel[A]
(initialMsg: A, maxIter: Int = Int.MaxValue, activeDir: EdgeDirection = EdgeDirection.Out)
(vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A):
Graph[VD, ED]
59 / 67
![Page 88: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/88.jpg)
Iterative Computation (6/9)
I pregel takes two argument lists: graph.pregel(list1)(list2).
I The first list contains configuration parameters• The initial message, the maximum number of iterations, and the edge direction in
which to send messages.
I The second list contains the user defined functions.• Gather: mergeMsg, Apply: vprog, Scatter: sendMsg
def pregel[A]
(initialMsg: A, maxIter: Int = Int.MaxValue, activeDir: EdgeDirection = EdgeDirection.Out)
(vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A):
Graph[VD, ED]
59 / 67
![Page 89: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/89.jpg)
Iterative Computation (6/9)
I pregel takes two argument lists: graph.pregel(list1)(list2).
I The first list contains configuration parameters• The initial message, the maximum number of iterations, and the edge direction in
which to send messages.
I The second list contains the user defined functions.• Gather: mergeMsg, Apply: vprog, Scatter: sendMsg
def pregel[A]
(initialMsg: A, maxIter: Int = Int.MaxValue, activeDir: EdgeDirection = EdgeDirection.Out)
(vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A):
Graph[VD, ED]
59 / 67
![Page 90: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/90.jpg)
Iterative Computation (7/9)
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val initialMsg = -9999
// (vertexID, (new vertex value, old vertex value))
val vertices: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((1L, (1, -1)),
(2L, (2, -1)), (3L, (3, -1)), (6L, (6, -1))))
val relationships: RDD[Edge[Boolean]] = sc.parallelize(Array(Edge(1L, 2L, true),
Edge(2L, 1L, true), Edge(2L, 6L, true), Edge(3L, 6L, true), Edge(6L, 1L, true),
Edge(6L, 3L, true)))
val graph = Graph(vertices, relationships)
60 / 67
![Page 91: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/91.jpg)
Iterative Computation (7/9)
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val initialMsg = -9999
// (vertexID, (new vertex value, old vertex value))
val vertices: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((1L, (1, -1)),
(2L, (2, -1)), (3L, (3, -1)), (6L, (6, -1))))
val relationships: RDD[Edge[Boolean]] = sc.parallelize(Array(Edge(1L, 2L, true),
Edge(2L, 1L, true), Edge(2L, 6L, true), Edge(3L, 6L, true), Edge(6L, 1L, true),
Edge(6L, 3L, true)))
val graph = Graph(vertices, relationships)
60 / 67
![Page 92: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/92.jpg)
Iterative Computation (7/9)
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val initialMsg = -9999
// (vertexID, (new vertex value, old vertex value))
val vertices: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((1L, (1, -1)),
(2L, (2, -1)), (3L, (3, -1)), (6L, (6, -1))))
val relationships: RDD[Edge[Boolean]] = sc.parallelize(Array(Edge(1L, 2L, true),
Edge(2L, 1L, true), Edge(2L, 6L, true), Edge(3L, 6L, true), Edge(6L, 1L, true),
Edge(6L, 3L, true)))
val graph = Graph(vertices, relationships)
60 / 67
![Page 93: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/93.jpg)
Iterative Computation (7/9)
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val initialMsg = -9999
// (vertexID, (new vertex value, old vertex value))
val vertices: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((1L, (1, -1)),
(2L, (2, -1)), (3L, (3, -1)), (6L, (6, -1))))
val relationships: RDD[Edge[Boolean]] = sc.parallelize(Array(Edge(1L, 2L, true),
Edge(2L, 1L, true), Edge(2L, 6L, true), Edge(3L, 6L, true), Edge(6L, 1L, true),
Edge(6L, 3L, true)))
val graph = Graph(vertices, relationships)
60 / 67
![Page 94: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/94.jpg)
Iterative Computation (8/9)
// Gather: the function for combining messages
def mergeMsg(msg1: Int, msg2: Int): Int = math.max(msg1, msg2)
// Apply: the function for receiving messages
def vprog(vertexId: VertexId, value: (Int, Int), message: Int): (Int, Int) = {
if (message == initialMsg) // superstep 0
value
else // superstep > 0
(math.max(message, value._1), value._1) // return (newValue, oldValue)
}
// Scatter: the function for computing messages
def sendMsg(triplet: EdgeTriplet[(Int, Int), Boolean]): Iterator[(VertexId, Int)] = {
val sourceVertex = triplet.srcAttr
if (sourceVertex._1 == sourceVertex._2) // newValue == oldValue for source vertex?
Iterator.empty // do nothing
else
// propogate new (updated) value to the destination vertex
Iterator((triplet.dstId, sourceVertex._1))
}
61 / 67
![Page 95: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/95.jpg)
Iterative Computation (8/9)
// Gather: the function for combining messages
def mergeMsg(msg1: Int, msg2: Int): Int = math.max(msg1, msg2)
// Apply: the function for receiving messages
def vprog(vertexId: VertexId, value: (Int, Int), message: Int): (Int, Int) = {
if (message == initialMsg) // superstep 0
value
else // superstep > 0
(math.max(message, value._1), value._1) // return (newValue, oldValue)
}
// Scatter: the function for computing messages
def sendMsg(triplet: EdgeTriplet[(Int, Int), Boolean]): Iterator[(VertexId, Int)] = {
val sourceVertex = triplet.srcAttr
if (sourceVertex._1 == sourceVertex._2) // newValue == oldValue for source vertex?
Iterator.empty // do nothing
else
// propogate new (updated) value to the destination vertex
Iterator((triplet.dstId, sourceVertex._1))
}
61 / 67
![Page 96: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/96.jpg)
Iterative Computation (8/9)
// Gather: the function for combining messages
def mergeMsg(msg1: Int, msg2: Int): Int = math.max(msg1, msg2)
// Apply: the function for receiving messages
def vprog(vertexId: VertexId, value: (Int, Int), message: Int): (Int, Int) = {
if (message == initialMsg) // superstep 0
value
else // superstep > 0
(math.max(message, value._1), value._1) // return (newValue, oldValue)
}
// Scatter: the function for computing messages
def sendMsg(triplet: EdgeTriplet[(Int, Int), Boolean]): Iterator[(VertexId, Int)] = {
val sourceVertex = triplet.srcAttr
if (sourceVertex._1 == sourceVertex._2) // newValue == oldValue for source vertex?
Iterator.empty // do nothing
else
// propogate new (updated) value to the destination vertex
Iterator((triplet.dstId, sourceVertex._1))
}
61 / 67
![Page 97: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/97.jpg)
Iterative Computation (9/9)
val minGraph = graph.pregel(initialMsg,
Int.MaxValue,
EdgeDirection.Out)(
vprog, // apply
sendMsg, // scatter
mergeMsg) // gather
minGraph.vertices.collect.foreach{
case (vertexId, (value, original_value)) => println(value)
}
62 / 67
![Page 98: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/98.jpg)
Graph Representation
I Vertex-cut partitioning
I Representing graphs using two RDDs: edge-collection and vertex-collection
I Routing table: a logical map from a vertex id to the set of edge partitions thatcontains adjacent edges.
63 / 67
![Page 99: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/99.jpg)
Summary
64 / 67
![Page 100: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/100.jpg)
Summary
I Think like an edge• XStream: edge-centric GAS, streaming partition
I Think like a table• Graphx: unifies data-parallel and graph-parallel systems.
65 / 67
![Page 101: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/101.jpg)
References
I A. Roy et al., “X-stream: Edge-centric graph processing using streaming partitions”,ACM SOSP 2013.
I J. Gonzalez et al., “GraphX: Graph Processing in a Distributed Dataflow Framework”,OSDI 2014
66 / 67
![Page 102: Large Scale Graph Processing - X-Stream and GraphX · Graph-Parallel Processing Model 34/67. Data-Parallel vs. Graph-Parallel Computation 35/67. Motivation (2/3) I Graph-parallelcomputation:restrictingthe](https://reader035.vdocuments.net/reader035/viewer/2022062508/604eaa560f9d262cd91eabc5/html5/thumbnails/102.jpg)
Questions?
67 / 67