tale of two graph frameworks: graph frames and tinkerpop
TRANSCRIPT
Artem Aliev and Russell Spitzer, DataStax
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP
#EUeco3
#EUeco3
Pierrot and Harlequin• Artem
• Graph Analytics Expert • Earth
• Russell • Distributed Systems Enthusiast • Earth
2
Tinkerpop and GraphFrames provide Complimentary Approaches for Graph Analytics
DataSet Catalyst
GraphFrames
3#EUeco3
Graphs are Vertices and Edges
4
Vertices are things and edges represent their relations to one another
#EUeco3
Graphs are Vertices and Edges
5
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution class[6]
Service: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise class[8][9]
Service: 2286–2293 (7 Years)
#EUeco3
Graphs are Vertices and Edges
6
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution class[6]
Service: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise class[8][9]
Service: 2286–2293 (7 Years)
Vertex
Properties#EUeco3
Graphs are Vertices and Edges
7
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution class[6]
Service: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise class[8][9]
Service: 2286–2293 (7 Years)
succeeded by
succeeded by
succeeded by
#EUeco3
Graphs are Vertices and Edges
8
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution class[6]
Service: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise class[8][9]
Service: 2286–2293 (7 Years)
Edge
Edge Labelsucceeded by
succeeded by
succeeded by
#EUeco3
Graphs are Vertices and Edges
9
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution class[6]
Service: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise class[8][9]
Service: 2286–2293 (7 Years)
Ship
Ship
Ship
ShipVertex Label
succeeded by
succeeded by
succeeded by
#EUeco3
Graphs are Vertices and Edges
10
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution classService: 2245–2285 (40 Years)
Ship
Ship
Ship
ShipPosition: Captain Name: Kirk
Position: Captain Name: Picard
Crew
Crew
succeeded by
succeeded by
succeeded by
#EUeco3
Graphs are Vertices and Edges
11
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution classService: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise classService: 2286–2293 (7 Years)
Ship
Ship
Ship
ShipPosition: Captain Name: Kirk
Position: Captain Name: Picard
Crew
Crew
succeeded by
succeeded by
succeeded byserved onserved on
served on
served on
#EUeco3
Graphs are Vertices and Edges
12
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution classService: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise classService: 2286–2293 (7 Years)
Ship
Ship
Ship
ShipPosition: Captain Name: Kirk
Position: Captain Name: Picard
Crew
Crew
succeeded by
succeeded by
succeeded byserved onserved on
served on
served on
But why do I want this?
#EUeco3
Graphs let us ask questions about our data based on their relations
13
What Captain Served After Kirk?
What Ship was two after the NCC-1701?
#EUeco3
Traversals involve following paths through the Graph
14
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-D)Class: GalaxyService: 2363–2371 (8 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution classService: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise classService: 2286–2293 (7 Years)
Ship
Ship
Ship
ShipPosition: Captain Name: Kirk
Position: Captain Name: Picard
Crew
Crew
succeeded by
succeeded by
succeeded byserved onserved on
served on
served on
#EUeco3
What Captain was After Kirk?
15
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise classService: 2286–2293 (7 Years)
Ship
Ship
Position: Captain Name: Kirk
Position: Captain Name: Picard
Crew
Crewsucceeded by
served on
served on
#EUeco3
What Ship was two after the NCC-1701?
16
Registry: USS Enterprise (NCC-1701-C)Class: AmbassadorService: 2332[11] – 2344 (12 Years)
Registry: USS Enterprise (NCC-1701)Class: Constitution classService: 2245–2285 (40 Years)
Registry: USS Enterprise (NCC-1701-A)Class: Enterprise classService: 2286–2293 (7 Years)
Ship
Ship
Ship
succeeded by
succeeded by
#EUeco3
Tinkerpop is a Powerful and Flexible Graph Framework
• Server, Language, Connectors • Graph Framework for
OLAP and OLTP • Node Centric Representations • Fluent API (Gremlin) • Fully Self Contained Framework
17#EUeco3
OLTP Examples
18#EUeco3 18
Movie Lens Example Schema
19
https://grouplens.org/datasets/movielens/
#EUeco3 19
20
#EUeco3
What happens when you have too much data?
21
#EUeco3
Tinkerpop Spark OLAP Mechanism• Instead of one traversal we traverse starting from all nodes simultaneously
22
Distribution Requires Partitioning
23
?
Big DataIndependent Chunks
of Data#EUeco3
#EUeco3
Vertex Stored in a PairRDD Id -> StarVertex(Edge and Property Information)
24
1
A
C
D
Star Vertex: Adjacency list representation1: "A", "Kirk"A: "C", "Kirk"C: "D", "Picard"D: "Picard" Just Id
Of Connected Vertex
#EUeco3
Vertex Program Runs Initializing Traverser for every Vertex
25
1
A
C
D
SparkMemory - Accumulator - Used for GlobalState
#EUeco3
Then we cycle through a message Passing Algorithm
26
1
A
C
D
1
A
C
D
1
A
C
D
SparkMemory - Accumulator - Used for GlobalState
#EUeco3
Then we cycle through a message Passing Algorithm
27
1
A
C
D
1
A
C
D
1
A
C
D
SparkMemory - Accumulator - Used for GlobalState
Passes messages from one Vertex to another with a join
#EUeco3
Then we cycle through a message Passing Algorithm
28
1
A
C
D
1
A
C
D
1
A
C
D
SparkMemory - Accumulator - Used for GlobalState
Repeat
#EUeco3
Then we cycle through a message Passing Algorithm
29
1
A
C
D
1
A
C
D
1
A
C
D
SparkMemory - Accumulator - Used for GlobalState
All Traversers HaltOr Program Terminates
Result!
#EUeco3
Example OLAP Traversals
30
#EUeco3
Tinkerpop Spark OLAP Pros/ConsPros • Every message pass requires only a single shuffle • Edges and edge properties accessible without a step • Very Flexible, Many Provider Specific Shortcuts possible • Internal properties can be any Java type • All in one, Server already ready for multiple clients Cons • Limited in ability to connect to external sources/other spark applications• Flexibility of framework allows for many platform specific shortcuts to be added• Genericness provides difficulty in making some optimizations • Edges co-partitioned with vertices, high degree nodes can cause memory issues
31
#EUeco3
GraphFrames Background• Third Party Package • https://graphframes.github.io/ • Integrates with Dataset/Dataframe in Spark • Relational under the hood
32
#EUeco3
GraphFrames are built of two DataFrames
33
Row
Column
#EUeco3
GraphFrames are built of two DataFrames
34
id job species
Geordi Chief Engineer
Human
Data Science Officer
Android
Vertex DataFrame
src dst relationship
Geordi Data Friend
Edge DataFrame
Friend
#EUeco3
GraphFrames are built of two DataFrames
35
id job species
Geordi Chief Engineer
Human
Data Science Officer
Android
Vertex DataFrame
src dst relationship
Geordi Data Friend
Edge DataFrame
Friend
Can Only Be Spark Types
#EUeco3
GraphFrames are built of two DataFrames
36
id job species
Geordi Chief Engineer
Human
Data Science Officer
Android
Vertex DataFrame
src dst relationship
Geordi Data Friend
Edge DataFrame
Friend
No Built in Labels
#EUeco3
Catalyst Optimizes any Requests• Simple requests using DataFrame api don't do
anything special • Some methods fall back to GraphX (RDD Based) • Others use pure DataFrame methods
37
#EUeco3
GraphFrames Motif Matching
38
GraphFrame(a)-[e]->(b)
V E
#EUeco3
GraphFrames Motif Matching
39
GraphFrame(a)-[e]->(b)
Vertex (a) Vertices as a UDT "A"V E
A: <VertexRow>
#EUeco3
GraphFrames Motif Matching
40
GraphFrame(a)-[e]->(b)
Vertex (a) Vertices as a UDT "A"
Edge [b] Edges as UDT "E"Join with edges where A.id = E.src
V E
A: <VertexRow>
JoinA: <VertexRow>, E: <EdgeRow>
#EUeco3
GraphFrames Motif Matching
41
GraphFrame(a)-[e]->(b)
Vertex (a) Vertices as a UDT "A"
[e] Vertices as UDT "B" Join with edges where E.dst = B.id
Edge
Vertex
[b] Edges as UDT "E"Join with edges where A.id = E.src
V E
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
Join
JoinA: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
#EUeco3
GraphFrames Motif Matching
42
GraphFrame(a)-[e]->(b)
Vertex (a) Vertices as a UDT "A"
[e] Vertices as UDT "B" Join with edges where E.dst = B.id
Edge
Vertex
[b] Edges as UDT "E"Join with edges where A.id = E.src
V E
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
Join
JoinA: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
THAT'S SO MANY JOINS
#EUeco3 43
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
DataFrames means Optimizations are Automatic
#EUeco3 44
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Select A.ID
Columns Pruned and Predicates Pushed
45
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Select A.ID
Columns Pruned and Predicates Pushed
#EUeco3
46
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Select A.ID
Columns Pruned and Predicates Pushed
#EUeco3
47
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Select A.ID
Columns Pruned and Predicates Pushed
#EUeco3
#EUeco3
All of the normal optimizations happen within this FrameWork
48
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Broadcast?
Broadcast?
#EUeco3
Code Generation and Internal Rows
49
Vertex
Edge
Vertex
A: <VertexRow>
A: <VertexRow>, E: <EdgeRow>
A: <VertexRow>, E: <EdgeRow>, B: <VertexRow>
Code Generation
Code Generation
Code Generation
Code Generation
Code Generation
#EUeco3
GraphFrames Examples
50
#EUeco3
GraphFrame Pros ConsPros • Much Faster on basic counts • Powerful optimizations + CodeGen • Easy to connect to other sources Cons • Slower on complex traversals (2 Joins per hop) • Relational Model not as Flexible
51
#EUeco3
Choosing the Right Framework
52
Choose TinkerPop OLAP For Long Paths
• More complicated queries • Traversals that require many hops
• g.V().out.out.out.out
• Avoid for simple counts and aggregations • Avoid if you have very high degree Vertices
53#EUeco3
Choose GraphFrames for Interoperability and Short Paths
• General Edge/Vertex stats groupCount, min, max • Connecting to other sources • Short paths • High Degree Vertices
• Avoid • Long path algorithms
54#EUeco3
#EUeco3
Choosing the Right Framework
55
Gremlin on Graphframes
OLTP backed by DSE Graph
Built in Spark
We write it!
Search Built In!
Advanced Security
#EUeco3
Thanks for Listening
56
Datastax Academy Graph Course https://academy.datastax.com/resources/ds330-datastax-enterprise-graph
Try out Datastax Enterprise! https://academy.datastax.com/quick-downloadsApache Tinkerpophttp://tinkerpop.apache.org/ GraphFrames Link https://graphframes.github.io/