graphx : unifying data-parallel and graph-parallel analytics
DESCRIPTION
GraphX : Unifying Data-Parallel and Graph-Parallel Analytics . Presented by Joseph Gonzalez Joint work with Reynold Xin , Daniel Crankshaw , Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014. *These slides are best viewed in P owerPoint with animation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/1.jpg)
GraphX:Unifying Data-Parallel and Graph-Parallel Analytics Presented by Joseph Gonzalez
Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion StoicaStrata 2014
*These slides are best viewed in PowerPoint with animation.
![Page 2: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/2.jpg)
Graphs are Central to Analytics
Raw Wikipedia
< / >< / >< / >XML
Hyperlinks PageRank Top 20 PagesTitle PRText
TableTitle Body Topic Model
(LDA) Word TopicsWor
d Topic
Editor GraphCommunityDetection
User Community
User Com.
Term-DocGraph
DiscussionTableUser Disc.
CommunityTopic
TopicCom.
![Page 3: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/3.jpg)
Update ranks in parallel Iterate until convergence
Rank of user i Weighted sum of
neighbors’ ranks
3
PageRank: Identifying Leaders
![Page 4: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/4.jpg)
The Graph-Parallel Pattern
4
Model / Alg. State
Computation depends only on the
neighbors
![Page 5: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/5.jpg)
Many Graph-Parallel Algorithms
• Collaborative Filtering– Alternating Least
Squares– Stochastic Gradient
Descent– Tensor Factorization
• Structured Prediction– Loopy Belief
Propagation– Max-Product Linear
Programs– Gibbs Sampling
• Semi-supervised ML
– Graph SSL – CoEM
• Community Detection– Triangle-Counting– K-core Decomposition– K-Truss
• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring
• Classification– Neural Networks
5
![Page 6: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/6.jpg)
Graph-Parallel Systems
6
Pregeloogle
Expose specialized APIs to simplify graph programming.
Exploit graph structure to achieve orders-of-magnitude performance
gains over more general data-parallel systems.
![Page 7: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/7.jpg)
PageRank on the Live-Journal Graph
GraphLab
Naïve Spark
Mahout/Hadoop
0 200 400 600 800 1000 1200 1400 1600
22
354
1340
Runtime (in seconds, PageRank for 10 iter-ations)
GraphLab is 60x faster than HadoopGraphLab is 16x faster than Spark
![Page 8: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/8.jpg)
Graphs are Central to Analytics
Raw Wikipedia
< / >< / >< / >XML
Hyperlinks PageRank Top 20 PagesTitle PRText
TableTitle Body Topic Model
(LDA) Word TopicsWor
d Topic
Editor GraphCommunityDetection
User Community
User Com.
Term-DocGraph
DiscussionTableUser Disc.
CommunityTopic
TopicCom.
![Page 9: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/9.jpg)
Separate Systems to Support Each View
Table View Graph View
Dependency Graph
Pregel
Table
Result
Row
Row
Row
Row
![Page 10: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/10.jpg)
Having separate systems for each view is
difficult to use and inefficient
10
![Page 11: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/11.jpg)
Difficult to Program and UseUsers must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often complex interfaces
11
![Page 12: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/12.jpg)
Inefficient
12
Extensive data movement and duplication across
the network and file system< / >< / >< / >
XML
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures across stages
![Page 13: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/13.jpg)
Solution: The GraphX Unified Approach
Enabling users to easily and efficiently express the entire graph
analytics pipeline
New APIBlurs the distinction between Tables and
Graphs
New SystemCombines Data-Parallel Graph-
Parallel Systems
![Page 14: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/14.jpg)
Tables and Graphs are composable
views of the same physical data
GraphX Unified
Representation
Graph ViewTable View
Each view has its own operators that
exploit the semantics of the view to achieve efficient execution
![Page 15: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/15.jpg)
View a Graph as a TableId
RxinJegonzalFranklinIstoica
SrcId DstIdrxin jegonzal
franklin
rxin
istoica franklinfrankli
njegonzal
Property (E)Friend
AdvisorCoworker
PI
Property (V)(Stu., Berk.)
(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)
R
J
F
I
Property GraphVertex Property Table
Edge Property Table
![Page 16: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/16.jpg)
Table OperatorsTable (RDD) operators are inherited from Spark:
16
mapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoin
reducecountfoldreduceByKeygroupByKeycogroupcrosszip
sampletakefirstpartitionBymapWithpipesave...
![Page 17: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/17.jpg)
class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])
// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]}
Graph Operators
17
![Page 18: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/18.jpg)
Triplets Join Vertices and Edges
The triplets operator joins vertices and edges:
The mrTriplets operator sums adjacent triplets.SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId
TripletsVertices Edges
BA
CD
A BA CB CC D
A BAB A C
B CC D
![Page 19: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/19.jpg)
F
E
Map Reduce TripletsMap-Reduce for each vertex
D
B
A
C
mapF( )A B
mapF( )A C
A1
A2
reduceF( , )A1 A2 A
19
![Page 20: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/20.jpg)
F
E
Example: Oldest Follower
D
B
A
CWhat is the age of the oldest follower for each user?val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices
23 42
30
19 75
1620
![Page 21: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/21.jpg)
We express the Pregel and GraphLab abstractions using the GraphX
operatorsin less than 50 lines of code!
21
By composing these operators we canconstruct entire graph-analytics
pipelines.
![Page 22: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/22.jpg)
DIY Demo this Afternoon
![Page 23: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/23.jpg)
GraphX System Design
![Page 24: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/24.jpg)
Part. 2
Part. 1
Vertex Table (RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table (RDD)A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
![Page 25: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/25.jpg)
Vertex Table (RDD)
Caching for Iterative mrTripletsEdge Table
(RDD)A B
A C
C D
B C
A E
A F
E F
E D
MirrorCache
BCD
A
MirrorCache
DEF
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D
![Page 26: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/26.jpg)
Vertex Table (RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
MirrorCache
BCD
A
MirrorCache
DEF
A
Incremental Updates for Iterative mrTriplets
B
C
D
E
A
F
Change AA
Change E
Scan
![Page 27: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/27.jpg)
Vertex Table (RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
MirrorCache
BCD
A
MirrorCache
DEF
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
LocalAggregate
LocalAggregate
BC
D
F
![Page 28: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/28.jpg)
Reduction in Communication Due to
Cached Updates
0 2 4 6 8 10 12 14 160.1
110
1001000
10000Connected Components on Twitter Graph
Iteration
Net
wor
k Co
mm
. (M
B)
Most vertices are within 8 hopsof all vertices in their comp.
![Page 29: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/29.jpg)
Benefit of Indexing Active Edges
0 2 4 6 8 10 12 14 1605
1015202530Connected Components on Twitter Graph
ScanIndexed
Iteration
Runt
ime
(Sec
onds
)
Scan All Edges
Index of “Active” Edges
![Page 30: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/30.jpg)
Join EliminationIdentify and bypass joins for unused triplets fields
»Example: PageRank only accesses source attribute
30
0 2 4 6 8 10 12 14 16 18 200
5000
10000
15000PageRank on TwitterThree Way Join
Join Elimination
IterationCom
mun
icat
ion
(MB)
Factor of 2 reduction in communication
![Page 31: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/31.jpg)
Additional Query Optimizations
Indexing and Bitmaps:»To accelerate joins across graphs»To efficiently construct sub-graphs
Substantial Index and Data Reuse:»Reuse routing tables across graphs and
sub-graphs»Reuse edge adjacency information and
indices31
![Page 32: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/32.jpg)
Performance Comparisons
GraphLab
Giraph
Mahout/Hadoop
0 200 400 600 800 1000 1200 1400 160022
68207
3541340
Runtime (in seconds, PageRank for 10 iter-ations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
![Page 33: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/33.jpg)
GraphX scales to larger graphs
GraphLab
GraphX
Giraph
0 200 400 600 800
203
451
749
Runtime (in seconds, PageRank for 10 iter-ations)
GraphX is roughly 2x slower than GraphLab»Scala + Java overhead: Lambdas, GC time, …»No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges
![Page 34: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/34.jpg)
PageRank is just one stage….
What about a pipeline?
![Page 35: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/35.jpg)
HDFSHDFS
ComputeSpark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than GraphLab
Raw Wikipedia < / >< / >< / >
XML
Hyperlinks PageRank Top 20 Pages
GraphLab + Spark
Giraph + Spark
0 200 400 600 800 1000 1200 1400 1600
342
1492
Total Runtime (in Seconds)
605
375
![Page 36: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/36.jpg)
The GraphX Stack(Lines of Code)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRank (5)
Connected Comp. (10)
Shortest Path (10)
ALS(40) LDA
(120)
K-core(51)
Triangle
Count(45)
SVD(40)
![Page 37: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/37.jpg)
StatusAlpha release as part of Spark 0.9
Seeking collaborators and feedback
![Page 38: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/38.jpg)
Conclusion and ObservationsDomain specific views: Tables and
Graphs»tables and graphs are first-class
composable objects»specialized operators which exploit view
semantics
Single system that efficiently spans the pipeline
»minimize data movement and duplication»eliminates need to learn and manage
multiple systems
Graphs through the lens of database systems
»Graph-Parallel Pattern Triplet joins in relational alg.
»Graph Systems Distributed join optimizations
38
![Page 39: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/39.jpg)
Active ResearchStatic Data Dynamic Data
»Apply GraphX unified approach to time evolving data
»Model and analyze relationships over time
Serving Graph Structured Data»Allow external systems to interact with
GraphX»Unify distributed graph databases with
relational database technology 39
![Page 40: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/40.jpg)
Thanks!
[email protected]@eecs.berkeley.edu
[email protected]@eecs.berkeley.edu
http://amplab.github.io/graphx/
![Page 41: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/41.jpg)
Graph Property 1Real-World Graphs
41
Top 1% of vertices are adjacent to50% of the
edges!
Num
ber o
f Ve
rtice
s
AltaVista WebGraph1.4B Vertices, 6.6B Edges
Degree
More than 108 vertices have one neighbor.
-Slope = α ≈ 2 2008 2009 2010 2011 20120
20406080
100120140160180200
Year
Ratio
of E
dges
to V
erti
ces
Power-Law Degree DistributionEdges >> Vertices
![Page 42: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/42.jpg)
Graph Property 2Active Vertices
0 10 20 30 40 50 60 701
10100
100010000
1000001000000
10000000100000000
Number of Updates
Num
-Ver
tice
s
51% updated only once!PageRank on Web Graph
![Page 43: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/43.jpg)
Graphs are Essential to Data Mining and Machine Learning
Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies
![Page 44: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/44.jpg)
Ratings Items
Recommending ProductsUsers
![Page 45: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/45.jpg)
Low-Rank Matrix Factorization:
45
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)User
Fact
ors (
U)
Movie Factors (M
)Us
ers
MoviesNetflix Us
ers≈ x
Movies
f(i)f(j)
Iterate:
Recommending Products
![Page 46: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/46.jpg)
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting User Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
??
?
?
??
?
? ??
?
?
??
? ?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
46
Conditional Random FieldBelief Propagation
![Page 47: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/47.jpg)
Count triangles passing through each vertex:
Measures “cohesiveness” of local community
More TrianglesStronger Community
Fewer TrianglesWeaker Community
12 3
4
Finding Communities
![Page 48: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics](https://reader036.vdocuments.net/reader036/viewer/2022062521/56816711550346895ddb7b1d/html5/thumbnails/48.jpg)
Preprocessing Compute Post Proc.
Example Graph Analytics Pipeline
48
< / >< / >< / >XML
RawData ETL Slice Comput
e
Repeat
Subgraph PageRankInitial Graph
Analyze
TopUsers