spark graph framesとopencypherによる分散グラフ処理の最新動向

/ 45

Spark GraphFramesとopenCypherによる分散グラフ処理の最新動向

ビッグデータ部加嵜長門

2016年3月8日

/ 45

自己紹介

•加嵜長門

• 2014年4月～ DMM.comラボ• Hadoop基盤構築

• Spark MLlib, GraphXを用いたレコメンド開発

•好きな言語• SQL

• Cypher

2

/ 45

GraphFramesとは？

• GraphFrames• http://graphframes.github.io/

•分散グラフ処理のための Apache Spark パッケージ

• Spark GraphXと DataFrames (SparkSQL) の統合

• Databricksが2016年3月3日にリリース

3

http://graphframes.github.io/

/ 45


• Spark Summit East 2016• 2016/2/18

4

https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/

https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/

/ 45


• Spark package• 2016/2/25

5

http://spark-packages.org/package/graphframes/graphframes

http://spark-packages.org/package/graphframes/graphframes

/ 45


• Introducing GraphFrames• 2016/3/3

6

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

https://databricks.com/blog/2016/03/03/introducing-graphframes.html

/ 45

GraphFramesの特徴

• openCypherによるグラフ検索

• Pregelを用いたグラフ処理

•分散処理

7

/ 45




•分散処理

8

/ 45

openCypherによるグラフ検索

•グラフ分析とグラフ検索

9

引用：http://www.slideshare.net/SparkSummit/graphframes-graph-queries-in-spark-sql-by-ankur-dave

http://www.slideshare.net/SparkSummit/graphframes-graph-queries-in-spark-sql-by-ankur-dave

/ 45

openCypher

•オープンソースのグラフクエリ言語• Neo4jのCypherから派生

• SQLに似た宣言的な記述が可能

10

MATCH (cypher:QueryLanguage)-[:QUERIES]->(graphs)MATCH (cypher)<-[:USES]-(u:User) WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]MATCH (openCypher)-[:MAKES_AVAILABLE]->(cypher)RETURN cypher.attributes-----------[‘awesome’,…]

http://www.opencypher.org/

http://www.opencypher.org/

/ 45

GraphFramesを試す

•使い方• Sparkと同様、Scala, Java, Python, R向けのAPIを使用可能

•インストール方法• Spark Shell でインタラクティブに試す

• Build.sbtを利用

11

/ 45


• Spark Shell でインタラクティブに試す• Spark 1.4以上に対応

• DataFramesの利点を活かすなら最新版を推奨

12

# spark をダウンロード$ wget http://ftp.jaist.ac.jp/pub/apache/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz$ tar xzvf spark-1.6.0-bin-hadoop2.6.tgz

# graphframesパッケージを指定してspark-shellを起動$ spark-1.6.0-bin-hadoop2.6/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6

http://ftp.jaist.ac.jp/pub/apache/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz

/ 45


• Build.sbtを利用

13

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.6.0","org.apache.spark" %% "spark-sql" % "1.6.0","org.apache.spark" %% "spark-graphx" % "1.6.0","graphframes" % "graphframes" % "0.1.0-spark1.6“

)

/ 45

GraphFrames –アイテムレコメンドの実行例

14

// graphframesパッケージのインポートscala> import org.graphframes._import org.graphframes._

// Vertex（頂点）となるDataFrameを作成scala> val v = sqlContext.createDataFrame(List(

| (0L, "user", "u1"),| (1L, "user", "u2"),| (2L, "item", "i1"),| (3L, "item", "i2"),| (4L, "item", "i3"),| (5L, "item", "i4")| )).toDF("id", "type", "name")

v: org.apache.spark.sql.DataFrame = [id: bigint, type: string, name: string]

u1

u2

i1

i2

i3

i4

ユーザ

アイテム

/ 45


15

// Edge（辺）となるDataFrameを作成scala> val e = sqlContext.createDataFrame(List(

| (0L, 2L, "purchase"),| (0L, 3L, "purchase"),| (0L, 4L, "purchase"),| (1L, 3L, "purchase"),| (1L, 4L, "purchase"),| (1L, 5L, "purchase")| )).toDF("src", "dst", "type")

e: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, type: string]

// GraphFrameを作成scala> val g = GraphFrame(v, e)g: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string, gender: string],e:[src: bigint, dst: bigint, relationship: string])

u1

u2

i1

i2

i3

i4

購入ログ

/ 45


16

// レコメンドアイテムの問い合わせ例scala> g.find(

| " (user1)-[]->(item1); (user2)-[]->(item1);" +| " (user2)-[]->(item2); !(user1)-[]->(item2)"| ).groupBy(| "user1.name", "item2.name"| ).count().show()

name name count

u1 i4 2

u2 i1 2

u1

u2

i1

i2

i3

i4

共通の商品を購入したユーザ

まだ購入していないアイテムをレコメンド

/ 45




•分散処理

17

/ 45

BSP, Pregel, Graph

18

Pregel

BSPApache Hama

グラフ特化

開発

実装

実装

活用

継承

影響

Open Graph

Graph Search

Knowledge Graph

/ 45

バルク同期並列(BSP)

19

/ 45


20

Concurrent computation Communication Barrier synchronisation

superstep

/ 45


21

Concurrent computation Communication Barrier synchronisation

superstep

/ 45

Question: PregelでAC間の距離を図る方法

22

A B C

a𝑏 + 𝑏𝑐

a𝑏 𝑏𝑐

/ 45

Question: PregelでAC間の距離を図る方法

23

A B CBa𝑏 𝑏𝑐

a𝑏 + 𝑏𝑐 ?

/ 45

A1. Iter=1, send message

24

A B Ca𝑏 𝑏𝑐

a𝑏

/ 45

A1. Iter=1, vertex program

25

A B Ca𝑏 𝑏𝑐

a𝑏

/ 45

A1. Iter=2, send message

26

A B Ca𝑏 𝑏𝑐

a𝑏

A B Ca𝑏 𝑏𝑐

a𝑏a𝑏 + 𝑏𝑐

/ 45

A1. Iter=2, vertex program

27

A B Ca𝑏 𝑏𝑐

a𝑏

A B Ca𝑏 𝑏𝑐

a𝑏 a𝑏 + 𝑏𝑐

/ 45

GraphX Pregel API

28



/ 45




•分散処理

29

/ 45

GraphFrames (GraphX) のデータ構造

•分散グラフ

30

http://spark.apache.org/docs/latest/graphx-programming-guide.html


/ 45

GraphFrames (GraphX) のデータ構造

•分散グラフ

31



/ 45

Partition Strategy

•次数 10000

• Partition数 100

32

Vn

V1

V2

V10000

・・・

Partition 1

Partition 2

Partition 100

・・・

?

/ 45

Partition Strategy

• RandomVertexCut• Hash(src, dst)

33

Vn

V1

V2

V10000

・・・

Partition 1

Partition 2

Partition 100

・・・

Vn V1

Vn V2

Vn V10000

1 Partition あたり平均 100 Edges

I/O効率が悪い

/ 45

Partition Strategy

• EdgePartition1D• Hash(src)

34

Vn

V1

V2

V10000

・・・

Partition 1

Partition 2

Partition 100

・・・

Vn V1Vn V2

Vn V10000

srcに対してPartitionが決まる

I/Oが発生するPartitionを限定できる

/ 45

Partition Strategy

• EdgePartition1D• Hash(src)

35

Vn

V1

V2

V10000

・・・

Partition 1

Partition 2

Partition 100

・・・

srcに対してPartitionが決まる

Edge の順方向にしか意味が無い

Vn V1

Vn V2

Vn V10000

/ 45

Partition Strategy

• EdgePartition2D

36

Vn

V1

V2

V10000

・・・

* * *Vn

V1 V2 V10000Partition 1

・・・

10/100Partitions

Partition 100

/ 45

Partition Strategy

• EdgePartition2D

37

*

*

*

Vn

V1

V2

V10000

Vn

V1

V10000

V2

・・・

・・・

10/100Partitions

/ 45

Partition Strategy

• EdgePartition2D

38

*

*

*

Vn

V1

V2

V10000

Vn

V1

V10000

V2

・・・

・・・

Vi

Vj

・・・

Vk

Vn

V1 V2 V10000・・・

高々20/100 Partitions=20%

200/10000 なら2%

/ 45

GraphFrames vs. Neo4j

39



/ 45

GraphFrames × Spark 2.0

40

引用： http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia

http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia

/ 45

参考文献

•複雑ネットワーク―基礎から応用まで• 増田直紀、今野紀雄

• http://www.amazon.co.jp/dp/4764903636

41

http://www.amazon.co.jp/dp/4764903636

/ 45

参考文献

• Cypherクエリー言語の事例で学ぶグラフデータベースNeo4j• 李昌桓


42


/ 45

参考文献

• Neo4j Webinar• http://neo4j.com/webinars/

• Bootstrapping Recommendations with Neo4j

• Fraud Detection with Neo4j

• Natural Language Processing with Graphs

• etc.

43

http://neo4j.com/webinars/

/ 45

参考文献

• Apache Spark Graph Processing• Rindra Ramamonjison


44


/ 45

参考文献

• Graph Mining: Laws, Tools, and Case Studies• Deepayan Chakrabarti, Christos Faloutsos

• http://www.amazon.com/dp/B00AF2CVE6

45

http://www.amazon.com/dp/B00AF2CVE6

spark graph framesとopencypherによる分散グラフ処理の最新動向

Engineering