graphframes: graph queries in apache spark sqllierranli/publications/graphframes-spark... ·...

20
GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)

Upload: trinhque

Post on 26-May-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

GraphFrames: Graph Queries in Apache Spark SQL

Ankur DaveUC Berkeley AMPLab

Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)

+ Graph Queries

2016Apache Spark + GraphFrames

GraphFrames (2016)

+ Graph Algorithms

2013Apache Spark + GraphX

Relational Queries

2009Spark

Graph Algorithms vs. Graph Queries

≈x

PageRank

Alternating Least Squares

Graph Algorithms Graph Queries

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1

Article 2

Editor 1

Editor 2

same day} same day}

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel(sendMsg = { e =>

e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

0.15 + 0.85 * msgSum})

Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))

.sort("d1 + d2".desc).take(10)

Separate SystemsGraph Algorithms Graph Queries

Raw Wikipedia

< / >< / >< / >XML

Text Table

Edit GraphEdit Table

Frequent Collaborators

Problem: Mixed Graph AnalysisHyperlinks PageRank

Article Text

User Article

Vandalism Suspects

User User

User Article

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL

GraphFrames API

Pattern Query Optimizer

GraphFrames API• Unifies graph algorithms, graph queries, and DataFrames• Available in Scala, Java, and Python

class GraphFrame {def vertices: DataFramedef edges: DataFrame

def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

}

Implementation

Parsed Pattern

Logical Plan

Materialized Views

Optimized Logical Plan

DataFrameResult

Query String

Graph–RelationalTranslation Join Elimination

and Reordering

Spark SQL

View SelectionGraph

AlgorithmsGraphX

Graph–Relational Translation

B

D

A

C

Existing Logical PlanOutput: A,B,C

Src Dst

⋈C=Src

Edge Table

ID Attr

Vertex Table

⋈D=ID

Materialized View Selection

GraphX: Triplet view enabled efficient message-passing algorithms

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A C

B C

C D

Graph

+

UpdatedPageRanks

B

A

C

D

A

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A CB CC D

Graph

User-Defined Views

PageRank

CommunityDetection

Graph Queries

Join Elimination

Src Dst1 21 32 32 5

Edges

ID Attr1 A2 B3 C4 D

VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;

Unnecessary join

can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:

SELECT src, dst FROM edges;

Join Reordering

A → B B → A

⋈A, BB → D

C→ B⋈BB → E⋈B

C→ D⋈BC→ E⋈C, D

⋈C, EExample Query

Left-Deep Plan Bushy Plan

A → B B → A

⋈A, B

B → D C→ B

⋈B

B → E⋈B⋈B⋈B, C

User-Defined View

EvaluationFaster than Neo4j for unanchored pattern queries

0

0.5

1

1.5

2

2.5

GraphFrames Neo4j

Que

ry la

tenc

y, s

Anchored Pattern Query

01020304050607080

GraphFrames Neo4j

Que

ry la

tenc

y, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

0

1

2

3

4

5

6

7

GraphFrames GraphX Naïve Spark

Per-i

tera

tion

runt

ime,

s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

EvaluationRegistering the right views can greatly improve performance for some queries

Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node

Try It Out!Released as a Spark Package at:

https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

[email protected]